Authors:Matthew Gwilliam, Roy Zhang, Namitha Padmanabhan, Hongyang Du, Abhinav Shrivastava
Abstract:
Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds. Our project website is available at https://mgwillia.github.io/vinrb/ and our code is available at https://github.com/mgwillia/vinrb.
Chinese: 该研究提出了Rabbit NeRV(RNeRV),一种先进的隐式神经表示方法,通过采用超网络实现实时编码,显著提升了视频压缩质量并解决了编码速度慢的问题。
English: The study introduces Rabbit NeRV (RNeRV), a state-of-the-art implicit neural representation method that enhances video compression quality and addresses slow encoding speeds by employing hyper-networks for real-time performance.
Authors:Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen
Abstract:
Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating optimization-level refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new state-of-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0%. Code is available at: https://github.com/Jiacheng8/FADRM.
Chinese: 本文首次提出数据残差匹配方法,通过数据级跳跃连接在数据集蒸馏任务中平衡新知识获取与原始数据核心信息,大幅提升计算效率并降低资源消耗,在多个基准测试中实现了最优性能。
English: This paper introduces Data Residual Matching, a novel data-centric approach that uses skip connections to enhance dataset distillation by balancing new knowledge acquisition with core data preservation, achieving state-of-the-art efficiency and accuracy with significant reductions in computational resources.
Authors:Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu
Abstract:
Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: https://github.com/Ironieser/TimesCLIP.
中文摘要:本文提出一种多模态对比学习框架,将时间序列转化为对齐的视觉与文本表征,通过增强语义对齐和变量选择机制,在预测任务中实现了卓越性能。
English Summary: This paper introduces a multimodal contrastive learning framework that transforms time series data into aligned visual and textual representations, achieving superior forecasting performance through enhanced semantic alignment and variate selection.
Authors:Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
Abstract:
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
中文: Calligrapher是一种基于扩散的框架,将文本定制与艺术字体相结合,通过自蒸馏和局部风格注入技术,精确复现复杂风格和字形,在数字设计应用中超越传统模型。
English: Calligrapher is a diffusion-based framework that combines text customization with artistic typography, using self-distillation and localized style injection to accurately reproduce intricate styles and precise glyphs, surpassing traditional models in digital design applications.
Authors:Yuqing Wang, Shangding Gu
Abstract:
Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complex tasks with limited prior knowledge. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{\min}$, and prove that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as $h_{\min}$ increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connections and function compositions in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: https://github.com/SafeRL-Lab/data-uniformity.
中文: 本研究证明,通过选择更均匀分布的数据来增加数据点间的最小成对距离,能够提高大型语言模型的训练效率和性能,加速梯度下降收敛并降低近似误差。
English: This research demonstrates that selecting more uniformly distributed data enhances training efficiency and performance in large language models by increasing the minimum pairwise distance between data points, which accelerates gradient descent convergence and reduces approximation error.
Authors:Yuqing Wang, Shangding Gu
Abstract:
Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complicated tasks. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{\min}$, and prove that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as $h_{\min}$ increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connection and function composition in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: https://github.com/SafeRL-Lab/data-uniformity.
中文: 本研究证明,通过选择更均匀分布的数据来增加数据点间的最小成对距离,能够提高大型语言模型的训练效率和性能,加速梯度下降收敛并降低近似误差。
English: This research demonstrates that selecting more uniformly distributed data enhances training efficiency and performance in large language models by increasing the minimum pairwise distance between data points, which accelerates gradient descent convergence and reduces approximation error.
Authors:Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin
Abstract:
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{https://github.com/Kevin-thu/Epona/}{https://github.com/Kevin-thu/Epona/}.
Chinese: Epona模型通过解耦时空建模与模块化轨迹集成,实现了长周期高分辨率的世界模拟,在视频预测质量和实时运动规划方面均超越现有方法。
English: The proposed Epona model introduces an autoregressive diffusion framework with decoupled spatiotemporal factorization and modular trajectory integration, enabling long-horizon, high-resolution world modeling that outperforms existing methods in both video prediction quality and real-time motion planning capabilities.
Authors:Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.
中文摘要:DenseWorld-1M数据集通过三阶段标注流程和专用视觉语言模型,解决了现有标注数据集中细节描述与空间关系缺失的问题,在多项视觉语言任务中验证了其有效性。
English Summary: The DenseWorld-1M dataset addresses limitations in existing caption datasets by providing detailed, dense grounded captions through a three-stage labeling pipeline and specialized VLM models, demonstrating effectiveness across multiple vision-language tasks.
Authors:Moein Heidari, Yasamin Medghalchi, Mahdi Khoursha, Reza Rezaeian, Ilker Hacihaliloglu
Abstract:
Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \href{GitHub}{https://github.com/moeinheidari7829/WaRA}.
中文: 提出的WaRA方法通过小波变换实现多分辨率分析,在视觉和语言任务中显著提升性能并降低计算成本,改进了参数高效微调技术。
English: The proposed WaRA method enhances parameter-efficient fine-tuning by applying wavelet transforms for multi-resolution analysis, achieving superior performance in vision and language tasks with reduced computational costs.
Authors:Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun
Abstract:
The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \href{https://github.com/JohnsonJiang1996/Awesome-VLA4AD}{SicongJiang/Awesome-VLA4AD}.
中文: 本综述首次全面概述了自动驾驶中的视觉-语言-动作模型,通过分析其发展历程、架构组件及20多个代表性模型,同时指出了鲁棒性和实时效率等关键挑战。
English: This survey provides the first comprehensive overview of Vision-Language-Action models for autonomous driving, analyzing their evolution, architectural components, and over 20 representative models while identifying key challenges like robustness and real-time efficiency.
Authors:Pei Zhan, Peng Tang, Yangzhuo Li, Puwen Wei, Shanqing Guo
Abstract:
Local differential privacy (LDP) involves users perturbing their inputs to provide plausible deniability of their data. However, this also makes LDP vulnerable to poisoning attacks. In this paper, we first introduce novel poisoning attacks for ranking estimation. These attacks are intricate, as fake attackers do not merely adjust the frequency of target items. Instead, they leverage a limited number of fake users to precisely modify frequencies, effectively altering item rankings to maximize gains. To tackle this challenge, we introduce the concepts of attack cost and optimal attack item (set), and propose corresponding strategies for kRR, OUE, and OLH protocols. For kRR, we iteratively select optimal attack items and allocate suitable fake users. For OUE, we iteratively determine optimal attack item sets and consider the incremental changes in item frequencies across different sets. Regarding OLH, we develop a harmonic cost function based on the pre-image of a hash to select that supporting a larger number of effective attack items. Lastly, we present an attack strategy based on confidence levels to quantify the probability of a successful attack and the number of attack iterations more precisely. We demonstrate the effectiveness of our attacks through theoretical and empirical evidence, highlighting the necessity for defenses against these attacks. The source code and data have been made available at https://github.com/LDP-user/LDP-Ranking.git.
中文: 本文针对本地差分隐私的排序估计提出了复杂的投毒攻击方法,设计了针对kRR、OUE和OLH协议的攻击策略与成本函数,并通过理论与实验验证了攻击效果,强调了防御此类攻击的必要性。
English: This paper introduces sophisticated poisoning attacks on local differential privacy ranking estimation, proposing attack strategies and cost functions for kRR, OUE, and OLH protocols, while demonstrating their effectiveness and the need for defensive measures.
Authors:Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Abstract:
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
Chinese: 近期语言模型的进展推动了对可解释图像描述评估指标的需求,因此开发了EXPERT这一无参考系统,它基于流畅性、相关性和描述性提供结构化评估,实现了顶尖性能和高品质解释。
English: Recent progress in language models has spurred the need for explainable image captioning metrics, leading to the development of EXPERT, a reference-free system that provides structured evaluations based on fluency, relevance, and descriptiveness, achieving top performance and high-quality explanations.
Authors:Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Abstract:
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
中文: TTA-VLM作为一个综合性基准,旨在解决当前视觉语言模型测试时适应研究中的局限性,通过统一框架和多样化评估指标,促进对多种方法、数据集及模型可信度的公平全面比较。
English: TTA-VLM is introduced as a comprehensive benchmark to address limitations in current test-time adaptation research for vision-language models, enabling fair and holistic evaluations across multiple methods, datasets, and metrics beyond just accuracy.
Authors:Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu
Abstract:
Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.
中文摘要:该双适配器框架通过频率引导的视觉提示和多级记忆存储,无需全微调即可增强跨模态交互和时间一致性,从而提升多模态跟踪性能。
English Summary: The proposed dual-adapter framework enhances multi-modal tracking by incorporating frequency-guided visual prompts and multi-level memory storage to improve cross-modal interaction and temporal consistency without full fine-tuning.
Authors:Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming zhou, Gangshan Wu, Jinde Cao
Abstract:
Prompt-learning-based multi-modal trackers have made strong progress by using lightweight visual adapters to inject auxiliary-modality cues into frozen foundation models. However, they still underutilize two essentials: modality-specific frequency structure and long-range temporal dependencies. We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker. A frequency-guided visual adapter adaptively transfers complementary cues across modalities by jointly calibrating spatial, channel, and frequency components, narrowing the modality gap without full fine-tuning. A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context, enabling consistent propagation across frames and robust recovery from occlusion, motion blur, and illumination changes. This unified design preserves the efficiency of prompt learning while strengthening cross-modal interaction and temporal coherence. Extensive experiments on RGB-Thermal, RGB-Depth, and RGB-Event benchmarks show consistent state-of-the-art results over fully fine-tuned and adapter-based baselines, together with favorable parameter efficiency and runtime. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.
中文摘要:该双适配器框架通过频率引导的视觉提示和多级记忆存储,无需全微调即可增强跨模态交互和时间一致性,从而提升多模态跟踪性能。
English Summary: The proposed dual-adapter framework enhances multi-modal tracking by incorporating frequency-guided visual prompts and multi-level memory storage to improve cross-modal interaction and temporal consistency without full fine-tuning.
Authors:Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung
Abstract:
Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
中文: 摘要阐述了多模态推理中的范式转变,即人工智能从仅对图像进行思考转向利用视觉信息作为动态认知工作空间进行思考,并概述了其三个关键发展阶段。
English: The abstract introduces a paradigm shift in multimodal reasoning where AI transitions from merely thinking about images to thinking with them, utilizing visual information as a dynamic cognitive workspace across three developmental stages.
Authors:Longliang Liu, Miaojie Feng, Junda Cheng, Jijun Xiang, Xuan Zhu, Xin Yang
Abstract:
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: https://github.com/longliangLiu/PriOr-Flow.
中文: PriOr-Flow提出了一种双分支框架,通过DCCL算子和ODDC模块有效减少全景光流估计中的畸变,在公开数据集上取得了领先的性能。
English: PriOr-Flow introduces a dual-branch framework with a DCCL operator and ODDC module to mitigate panoramic distortion in optical flow estimation, achieving state-of-the-art performance on benchmark datasets.
Authors:Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
Abstract:
The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.
中文: 本文提出的VMoBA是一种新型稀疏注意力机制,通过自适应时空模式显著加速视频扩散模型训练,在保持甚至提升视频生成质量的同时实现了可观的加速效果。
English: This paper introduces VMoBA, a novel sparse attention mechanism that accelerates Video Diffusion Models by adapting to spatio-temporal patterns, achieving significant speed improvements while maintaining or enhancing video generation quality.
Authors:Ji Zhang, Shihan Wu, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen
Abstract:
Despite the great promise of Prompt Tuning (PT) in adapting large Vision-Language Pretrained Models (VLPMs) to downstream tasks, they often struggle to overcome the Base-New Tradeoff (BNT) dilemma: as VLPMs are better tuned to a base task, their ability to generalize to new tasks diminishes. Recent work on conditional PT addresses this problem by replacing static prompts with dynamic Visual Image Information (VII)-conditioned prompts, improving the model's generalization to new tasks to some extent. In this work, we first identify a critical issue with existing conditional PT methods: using VII as the "condition" of prompts yields suboptimal performance, and even random noise-conditioned prompts can outperform the VII-conditioned counterparts. On further analysis, we find that learning dynamic prompts conditioned on Textual Class Information (TCI) is the key to solving the BNT problem. Motivated by this, we then propose Class-adaptive Prompt Tuning (CaPT), which enables fast adaptation of tuned models to new classes by learning TCI-conditioned prompts from base classes. Remarkably, CaPT can be used as a plugin to mitigate the BNT problem for existing unconditional PT schemes. Extensive experiments on 11 datasets show that CaPT consistently improves the performance of five strong unconditional PT baselines with negligible additional computational cost. Additionally, by integrating CaPT with our recently proposed DePT framework, we devise a new conditional PT approach, termed DeCaPT, which outperforms the H ACC of the state-of-the-art conditional PT scheme by 3.49%, averaged over the 11 datasets. Code: https://github.com/Koorye/CaPT.
中文: 提示调优在视觉语言模型中常面临基础-新任务权衡困境,即基础任务性能提升会削弱对新任务的泛化能力,而提出的类自适应提示调优方法通过基于文本类别信息学习动态提示,有效解决了这一问题,以极小的额外计算成本在多个数据集上显著提升了性能。
English: Prompt Tuning for vision-language models often faces the Base-New Tradeoff dilemma, where improved base task performance reduces generalization to new tasks, but the proposed Class-adaptive Prompt Tuning (CaPT) method effectively addresses this by learning dynamic prompts conditioned on textual class information, enhancing performance across multiple datasets with minimal added cost.
Authors:Jianing Jin, Jiangyong Ying, Huiyu Duan, Liu Yang, Sijing Wu, Yunhao Li, Yushuo Zheng, Xiongkuo Min, Guangtao Zhai
Abstract:
As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: https://github.com/IntMeGroup/RGC-VQA.
中文: 本研究创新性地提出了机器人生成内容(RGC)概念,建立了首个包含2100个视频的RGC数据库,并通过实验证明现有视频质量评估模型对RGC内容表现不佳,亟需开发专用评估模型。
English: The study introduces Robotic-Generated Content (RGC) as a new video category from robots' perspectives, establishes the first RGC database with 2,100 videos, and reveals that current video quality assessment models perform poorly on RGC, highlighting the need for specialized models.
Authors:Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng
Abstract:
Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring high-fidelity object reconstruction. In this paper, we introduce Scene-Consistent Object Refinement via Proxy Generation and Tuning (SCORP), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting with proxy generation by substituting degraded objects using a 3D generation model, SCORP then progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DoF pose, followed by correcting spatial and appearance inconsistencies through registration-constrained enhancement. This two-stage proxy tuning ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Across challenging benchmarks, SCORP achieves consistent gains over recent state-of-the-art baselines on both novel view synthesis and geometry completion tasks. SCORP is available at https://github.com/PolySummit/SCORP.
中文: SCORP是一种新颖的3D增强框架,通过生成和优化物体代理来解决场景重建中的视角缺失问题,在保持空间一致性的同时实现高保真几何与外观重建。
English: SCORP is a novel 3D enhancement framework that addresses viewpoint gaps in scene reconstruction by generating and refining object proxies to achieve high-fidelity geometry and appearance while maintaining spatial consistency.
Authors:Mingcheng Qu, Yuncong Wu, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan
Abstract:
Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactions between target and neighboring information (e.g., gene co-expression). This leads to a failure in establishing connections among adjacent regions and capturing intricate cross-modal relationships. To address these issues, we propose NH2ST, a framework that integrates spatial context and both pathology and gene modalities for gene expression prediction. Our model comprises a query branch and a neighbor branch to process paired target patch and gene data and their neighboring regions, where cross-attention and contrastive learning are employed to capture intrinsic associations and ensure alignments between pathology and gene expression. Extensive experiments on six datasets demonstrate that our model consistently outperforms existing methods, achieving over 20% in PCC metrics. Codes are available at https://github.com/MCPathology/NH2ST
中文: 提出的NH2ST框架通过整合空间上下文和多模态数据,利用交叉注意力和对比学习改进病理图像中的基因表达预测,在六个数据集上始终优于现有方法,PCC指标提升超过20%。
English: The proposed NH2ST framework enhances gene expression prediction from pathology images by integrating spatial context and multi-modal data through cross-attention and contrastive learning, consistently outperforming existing methods by over 20% in PCC metrics across six datasets.
Authors:Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin
Abstract:
Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.
中文摘要:Flash-VStream是一种高效视频语言模型,通过双内存模块处理超长视频并实时响应用户查询,在多项基准测试中实现了最先进性能与显著效率提升。
English Summary: Flash-VStream is an efficient video language model that processes long videos in real-time using a dual-memory module to reduce latency while maintaining state-of-the-art performance on multiple benchmarks.
Authors:Shiming Chen, Bowen Duan, Salman Khan, Fahad Shahbaz Khan
Abstract:
Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: https://github.com/shiming-chen/LaZSL.
中文摘要:LaZSL是一种新型视觉语言模型,通过最优传输实现局部视觉特征与语义属性的对齐,无需额外训练即可提升零样本学习的可解释性和准确性。
English Summary: LaZSL is a novel vision-language model that enhances interpretability in zero-shot learning by aligning local visual features with semantic attributes through optimal transport, improving both accuracy and explainability without extra training.
Authors:Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Abstract:
An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at https://github.com/mahshid1998/MadCLIP.
中文: 本研究提出了一种基于CLIP模型的创新少样本异常检测方法,通过双分支结构和可学习提示实现医学图像的精准分类与分割,在多项评估中均优于现有方法且无需合成数据。
English: This study introduces a novel few-shot anomaly detection method using the CLIP model for medical data, achieving superior performance in both image classification and pixel segmentation without synthetic data or memory banks.
Authors:Yongjian Wu, Yang Zhou, Jiya Saiyin, Bingzheng Wei, Yan Xu
Abstract:
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at https://github.com/WitGotFlg/VisTex-OVLM.
中文:VisTex-OVLM通过视觉文本化将视觉样本投影到文本特征空间,在保持预训练对齐的同时增强对罕见类别的检测能力,无需改变原有架构即在少样本基准测试中取得了最优性能。
English: VisTex-OVLM introduces visual textualization to project visual exemplars into text space, enhancing object detection for rare categories while preserving pre-trained alignment, achieving state-of-the-art results on few-shot benchmarks without altering the original model architecture.
Authors:Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang
Abstract:
Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack
Chinese: 本文提出Mamba-FETrack V2,一种基于视觉Mamba网络和提示生成器的高效RGB-事件目标跟踪框架,在降低计算复杂度的同时实现了卓越的跟踪性能。
English: This paper introduces Mamba-FETrack V2, an efficient RGB-Event object tracking framework that uses a Vision Mamba network and a prompt generator to achieve superior performance with reduced computational complexity.
Authors:Xue Wen Tan, Stanley Kok
Abstract:
Every publicly traded U.S. company files an annual 10-K report containing critical insights into financial health and risk. We propose Tiny eXplainable Risk Assessor (TinyXRA), a lightweight and explainable transformer-based model that automatically assesses company risk from these reports. Unlike prior work that relies solely on the standard deviation of excess returns (adjusted for the Fama-French model), which indiscriminately penalizes both upside and downside risk, TinyXRA incorporates skewness, kurtosis, and the Sortino ratio for more comprehensive risk assessment. We leverage TinyBERT as our encoder to efficiently process lengthy financial documents, coupled with a novel dynamic, attention-based word cloud mechanism that provides intuitive risk visualization while filtering irrelevant terms. This lightweight design ensures scalable deployment across diverse computing environments with real-time processing capabilities for thousands of financial documents which is essential for production systems with constrained computational resources. We employ triplet loss for risk quartile classification, improving over pairwise loss approaches in existing literature by capturing both the direction and magnitude of risk differences. Our TinyXRA achieves state-of-the-art predictive accuracy across seven test years on a dataset spanning 2013-2024, while providing transparent and interpretable risk assessments. We conduct comprehensive ablation studies to evaluate our contributions and assess model explanations both quantitatively by systematically removing highly attended words and sentences, and qualitatively by examining explanation coherence. The paper concludes with findings, practical implications, limitations, and future research directions. Our code is available at https://github.com/Chen-XueWen/TinyXRA.
Chinese: TinyXRA 是一种轻量级、可解释的变压器模型,它利用高级指标和可视化技术从10-K报告中评估公司风险,实现了最先进的预测精度并具备实时处理能力。
English: TinyXRA is a lightweight, explainable transformer model that assesses company risk from 10-K reports using advanced metrics and visualization, achieving state-of-the-art accuracy with real-time processing capabilities.
Authors:JiaRu Wu, Mingwei Liu
Abstract:
Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.
AutoEvoEval introduces an evolution-based framework with 22 atomic operations to systematically evaluate LLM robustness, revealing significant accuracy drops from perturbations and exposing overestimated generalization in current benchmarks.
English Summary:
Authors:Chang'an Yi, Xiaohui Deng, Guohao Chen, Yan Zhou, Qinghua Lu, Shuaicheng Niu
Abstract:
Test-time Adaptation (TTA) adapts a given model to testing domain data with potential domain shifts through online unsupervised learning, yielding impressive performance. However, to date, existing TTA methods primarily focus on single-model adaptation. In this work, we investigate an intriguing question: how does cross-model knowledge influence the TTA process? Our findings reveal that, in TTA's unsupervised online setting, each model can provide complementary, confident knowledge to the others, even when there are substantial differences in model size. For instance, a smaller model like MobileViT (10.6M parameters) can effectively guide a larger model like ViT-Base (86.6M parameters). In light of this, we propose COCA, a Cross-Model Co-Learning framework for TTA, which mainly consists of two main strategies. 1) Co-adaptation adaptively integrates complementary knowledge from other models throughout the TTA process, reducing individual model biases. 2) Self-adaptation enhances each model's unique strengths via unsupervised learning, enabling diverse adaptation to the target domain. Extensive experiments show that COCA, which can also serve as a plug-and-play module, significantly boosts existing SOTAs, on models with various sizes--including ResNets, ViTs, and Mobile-ViTs--via cross-model co-learned TTA. For example, with Mobile-ViT's guidance, COCA raises ViT-Base's average adaptation accuracy on ImageNet-C from 51.7% to 64.5%. The code is publicly available at https://github.com/ycarobot/COCA.
中文: COCA提出了一种跨模型协同学习的测试时自适应框架,通过整合不同模型间的互补知识并强化各自优势,显著提升了多种模型在目标域上的适应性能,例如将ViT-Base在ImageNet-C上的准确率从51.7%提高到64.5%。
English: Test-time Adaptation (TTA) is enhanced by COCA, a cross-model co-learning framework that leverages complementary knowledge between models of different sizes to boost adaptation accuracy, as demonstrated by significant improvements on benchmarks like ImageNet-C.
Authors:Smriti Joshi, Richard Osuala, Lidia Garrucho, Kaisar Kushibar, Dimitri Kessler, Oliver Diaz, Karim Lekadir
Abstract:
Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at https://github.com/smriti-joshi/muvi.git.
中文摘要:该研究提出的基于多视角协同训练的补丁方法,通过不确定性引导的自训练仅需单张测试图像即可实现有效的体积分割,在乳腺MRI数据集上以3.75%的Dice分数超越现有最优方法。
English Summary: The proposed patch-based multi-view co-training method enables effective volumetric segmentation using only a single test image through uncertainty-guided self-training, outperforming existing methods by 3.75% Dice score on breast MRI datasets.
Authors:Lingtong Zhang, Mengdie Song, Xiaohan Hao, Huayu Mai, Bensheng Qiu
Abstract:
Magnetic Resonance Imaging (MRI) reconstruction is essential in medical diagnostics. As the latest generative models, diffusion models (DMs) have struggled to produce high-fidelity images due to their stochastic nature in image domains. Latent diffusion models (LDMs) yield both compact and detailed prior knowledge in latent domains, which could effectively guide the model towards more effective learning of the original data distribution. Inspired by this, we propose Multi-domain Diffusion Prior Guidance (MDPG) provided by pre-trained LDMs to enhance data consistency in MRI reconstruction tasks. Specifically, we first construct a Visual-Mamba-based backbone, which enables efficient encoding and reconstruction of under-sampled images. Then pre-trained LDMs are integrated to provide conditional priors in both latent and image domains. A novel Latent Guided Attention (LGA) is proposed for efficient fusion in multi-level latent domains. Simultaneously, to effectively utilize a prior in both the k-space and image domain, under-sampled images are fused with generated full-sampled images by the Dual-domain Fusion Branch (DFB) for self-adaption guidance. Lastly, to further enhance the data consistency, we propose a k-space regularization strategy based on the non-auto-calibration signal (NACS) set. Extensive experiments on two public MRI datasets fully demonstrate the effectiveness of the proposed methodology. The code is available at https://github.com/Zolento/MDPG.
中文摘要:本研究提出的多域扩散先验引导方法通过整合潜在扩散模型与视觉曼巴骨干网络及双域融合技术,有效提升了磁共振成像重建的数据一致性和图像保真度。
English Summary: The proposed Multi-domain Diffusion Prior Guidance (MDPG) method enhances MRI reconstruction by integrating latent diffusion models with a Visual-Mamba backbone and dual-domain fusion, achieving improved data consistency and image fidelity.
Authors:Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Abstract:
Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).
中文摘要:L-Zero (L0) 系统提出了一种可扩展的训练框架,通过强化学习使大语言模型具备强大的问题解决能力,在SimpleQA和HotpotQA等事实性基准测试中显著提升了准确率。
English Summary: The L-Zero (L0) system introduces a scalable training pipeline that enables large language models to develop robust problem-solving skills through reinforcement learning, significantly improving accuracy on factuality benchmarks like SimpleQA and HotpotQA.
Authors:Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner
Abstract:
In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
中文: Deepbench框架利用大语言模型生成针对特定领域的图像干扰,以评估视觉语言模型的鲁棒性,结果发现在不同现实领域中性能存在显著差异,凸显了针对性评估的必要性。
English: Deepbench is a framework that uses a large language model to generate domain-specific image corruptions for evaluating the robustness of vision-language models, revealing significant performance variations across different real-world domains and emphasizing the need for targeted assessments.
Authors:Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail
Abstract:
We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack's effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack
中文摘要:本研究扩展了BeamAttack算法,通过增加单词删除和LIME引导的替换功能,在保持文本相似度的同时实现了超过99%的攻击成功率,并在多数据集和模型上验证了其有效性与局限性。
English Summary: The study enhances BeamAttack by incorporating word deletions and LIME-guided substitutions, achieving over 99% attack success on text classifiers while maintaining text similarity, as validated through comprehensive evaluations.
Authors:Zhe Liu, Yuhao Huang, Lian Liu, Chengrui Zhang, Haotian Lin, Tong Han, Zhiyuan Zhu, Yanlin Chen, Yuerui Chen, Dong Ni, Zhongshan Gou, Xin Yang
Abstract:
Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer's diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: https://github.com/cskdstz/MReg.
中文: 本研究提出MReg模型,通过四腔心彩色多普勒超声视频自动诊断二尖瓣反流,采用特征挖掘策略和回归方法,在提高诊断准确性的同时更好地契合临床工作流程。
English: This study introduces MReg, an automated model for diagnosing mitral regurgitation using 4-chamber cardiac color Doppler echocardiography videos, which employs feature mining strategies and a regression-based approach to improve accuracy and clinical alignment over existing methods.
Authors:Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang
Abstract:
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS.
中文: 提出的视觉中心Transformer(VCT)框架通过视觉驱动查询解决视听分割中的感知模糊和细节丢失问题,能更好区分混合音频中的发声物体并精确勾勒轮廓,在AVSBench数据集上实现了最优性能。
English: The proposed Vision-Centric Transformer (VCT) framework addresses audio-visual segmentation limitations by using vision-derived queries to better distinguish sounding objects from mixed audio and accurately delineate their contours, achieving state-of-the-art performance on AVSBench datasets.
Authors:Min-Yeong Park, Won-Jeong Lee, Seong Tae Kim, Gyeong-Moon Park
Abstract:
Recently, forecasting future abnormal events has emerged as an important scenario to tackle real-world necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address the AP task, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies. Our implementation code is available at https://github.com/KU-VGI/AP.
中文: A2P框架通过异常感知预测和合成异常提示,学习异常关联并模拟多样化模式,在多个真实数据集上展现出超越现有方法的未来异常预测能力。
English: The A2P framework, integrating Anomaly-Aware Forecasting and Synthetic Anomaly Prompting, effectively predicts future anomaly occurrences by learning anomaly relationships and simulating diverse patterns, outperforming existing methods in real-world datasets.
Authors:Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, Chao Zhang
Abstract:
Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://github.com/zou-yawen/Dataset-Distillation-via-Vision-Language-Category-Prototype/
中文: 本研究通过融合视觉语言方法,利用大型语言模型生成的文本原型与图像原型协同合成数据,提升了数据集蒸馏的性能、泛化能力和生成图像的逻辑一致性,实现了最先进的验证效果。
English: This study enhances dataset distillation by integrating vision-language methods, using text prototypes from a large language model to collaboratively synthesize data with image prototypes, which improves performance, generalization, and logical coherence in generated images, achieving state-of-the-art results.
Authors:Nuo Chen, Chao Xiao, Yimian Dai, Shiman He, Miao Li, Wei An
Abstract:
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 $\times$ 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https://github.com/ChenYichen9527/Ev-UAV.
中文摘要:本文提出了首个用于反无人机任务的大规模事件型小目标检测数据集EV-UAV,包含极小目标和多样化场景,同时开发了一种基于时空事件点云分割的新网络及相关损失函数,实验证明了该方法的优越性。
English Summary: This paper introduces EV-UAV, the first large-scale event-based dataset for small object detection in anti-UAV tasks, featuring extremely small targets and diverse scenarios, along with a novel segmentation network and spatiotemporal loss function that demonstrate superior performance.
Authors:Mingqian Ji, Jian Yang, Shanshan Zhang
Abstract:
Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2$\%$ mAP and 64.8$\%$ NDS on the nuScenes test benchmark. Code will be available at https://github.com/Mingqj/OcRFDet.
中文摘要:现有多视角3D物体检测方法因隐式数据驱动方式存在性能局限,本文提出的以物体为中心的辐射场(OcRF)通过聚焦前景物体建模,结合渲染任务和基于不透明度的注意力机制增强特征,在nuScenes基准测试中实现了最优性能。
English Summary: Current multi-view 3D object detection methods face performance limitations due to implicit data-driven approaches, which the proposed Object-centric Radiance Fields (OcRF) addresses by focusing on foreground objects and enhancing features through rendering and opacity-based attention, achieving state-of-the-art results on nuScenes benchmarks.
Authors:Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
Abstract:
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
Chinese: MMReason基准通过提供需要多步骤推理的多样化挑战性问题,结合详细解答和三值评分机制,旨在弥补现有MLLM评估在长链推理能力检测上的不足,实现精准全面的能力评估。
English: The MMReason benchmark is introduced to address the limitations of existing MLLM evaluations by providing diverse, challenging questions that require multi-step reasoning, incorporating detailed solutions and a ternary scoring mechanism to accurately assess long-chain reasoning capabilities.
Authors:Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu
Abstract:
Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.
中文: PPFlow方法根据扩散模型中不同噪声水平自适应调整补丁大小,在保持相近生成性能的同时,将推理速度提升高达2倍并降低计算成本。
English: The PPFlow method adapts patch sizes based on noise levels in diffusion transformers, improving inference speed by up to 2x while maintaining similar performance and reducing computational costs.
Authors:Weida Wang, Changyong He, Jin Zeng, Di Qiu
Abstract:
Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code will be released at \href{https://github.com/davidweidawang/GIGA-ToF}{https://github.com/davidweidawang/GIGA-ToF}.
中文: 本文提出了一种基于运动不变图融合的ToF深度去噪网络,通过跨帧几何注意力和噪声分布优化,有效提升时间一致性与空间清晰度,在合成与真实数据集上均达到最优性能。
English: This paper introduces a motion-invariant graph fusion network for ToF depth denoising that enhances temporal stability and spatial sharpness by leveraging cross-frame geometric attention and noise-aware optimization, achieving state-of-the-art performance on synthetic and real datasets.
Authors:Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Abstract:
Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
中文: 本文提出一种智能系统,通过去噪扩散模型和强化学习,自动完成三维超声中的平面定位和先天性子宫异常诊断,实验证明其方法高效可靠。
English: This paper introduces an intelligent system that uses a denoising diffusion model and reinforcement learning to automate both plane localization and diagnosis of congenital uterine anomalies from 3D ultrasound data, demonstrating high effectiveness in experiments.
Authors:Xinyue Li, Zhangkai Ni, Wenhan Yang
Abstract:
Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but they rely more on empirical design rather than theoretical foundation, which can impact their reliability. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks -- alignment and fusion -- optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding -- transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet's superior performance, consistently surpassing state-of-the-art methods. Our code is available at: https://github.com/eezkni/AFUNet
中文: 所提出的AFUNet方法通过深度展开技术将HDR重建系统解耦为对齐与融合子任务,通过交替优化和明确数学基础实现了超越现有方法的性能表现。
English: The proposed AFUNet method systematically decouples HDR reconstruction into alignment and fusion subtasks through deep unfolding, achieving superior performance by alternating refinement with explicit mathematical foundations.
Authors:Yu Zhang, Ruijie Yu, Jidong Tian, Feng Zhu, Jiapeng Liu, Xiaokang Yang, Yaohui Jin, Yanyan Xu
Abstract:
With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model's advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: https://github.com/Zhanghahah/ChemActor.
中文:ChemActor是一种经过精细调优的大语言模型,通过序列化数据生成框架将化学实验步骤转化为结构化操作,在多项任务中表现卓越,性能超越基线模型10%。
English: ChemActor is a fine-tuned large language model that converts chemical procedures into structured actions using a sequential data generation framework, achieving state-of-the-art performance with a 10% improvement over baselines.
Authors:Sai Krishna Ghanta, Ramviyas Parasuraman
Abstract:
Relative localization is a crucial capability for multi-robot systems operating in GPS-denied environments. Existing approaches for multi-robot relative localization often depend on costly or short-range sensors like cameras and LiDARs. Consequently, these approaches face challenges such as high computational overhead (e.g., map merging) and difficulties in disjoint environments. To address this limitation, this paper introduces MGPRL, a novel distributed framework for multi-robot relative localization using convex-hull of multiple Wi-Fi access points (AP). To accomplish this, we employ co-regionalized multi-output Gaussian Processes for efficient Radio Signal Strength Indicator (RSSI) field prediction and perform uncertainty-aware multi-AP localization, which is further coupled with weighted convex hull-based alignment for robust relative pose estimation. Each robot predicts the RSSI field of the environment by an online scan of APs in its environment, which are utilized for position estimation of multiple APs. To perform relative localization, each robot aligns the convex hull of its predicted AP locations with that of the neighbor robots. This approach is well-suited for devices with limited computational resources and operates solely on widely available Wi-Fi RSSI measurements without necessitating any dedicated pre-calibration or offline fingerprinting. We rigorously evaluate the performance of the proposed MGPRL in ROS simulations and demonstrate it with real-world experiments, comparing it against multiple state-of-the-art approaches. The results showcase that MGPRL outperforms existing methods in terms of localization accuracy and computational efficiency. Finally, we open source MGPRL as a ROS package https://github.com/herolab-uga/MGPRL.
中文: 本文提出MGPRL分布式框架,利用Wi-Fi接入点的凸包结构和高斯过程实现高效鲁棒的多机器人相对定位,在无GPS环境中展现出更优的定位精度和计算效率。
English: This paper presents MGPRL, a distributed framework that uses convex hulls of Wi-Fi access points and Gaussian Processes for efficient and robust multi-robot relative localization, demonstrating superior accuracy and computational efficiency in GPS-denied environments.
Authors:ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang
Abstract:
In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B
ZonUI-3B 是一种轻量级视觉语言模型,通过创新的数据策略和两阶段微调方法,在图形用户界面定位任务上达到与更大模型相当的性能,且仅需单个消费级GPU即可完成训练。
ZonUI-3B is a lightweight vision-language model that achieves performance comparable to larger models on GUI grounding tasks through innovative data strategies and two-stage fine-tuning, while being trainable on a single consumer GPU.
Authors:Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen
Abstract:
Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA.
中文: 提出的TAIRA系统通过思维模式蒸馏的多智能体框架,有效处理复杂用户意图,在多个数据集上展现出优于现有方法的交互推荐性能。
English: The proposed TAIRA system enhances interactive recommendation by using a multi-agent framework with thought pattern distillation to effectively address complex user intents, demonstrating superior performance across diverse datasets.
Authors:Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu
Abstract:
AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.
中文摘要:本文提出TAG-WM篡改感知生成式水印方法,通过嵌入双重水印保护AI生成图像免受恶意篡改,在保持生成质量的同时实现了最优的鲁棒性。
English Summary: This paper introduces TAG-WM, a tamper-aware generative watermarking method that embeds dual watermarks to protect AI-generated images against malicious tampering while maintaining generation quality and achieving state-of-the-art robustness.
Authors:Xian Zhang, Xiang Cheng
Abstract:
Objectives: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly enhanced their reasoning capabilities, enabling a wide range of intelligent applications. However, these advancements also raise critical concerns regarding privacy and ethics. MLLMs are now capable of inferring the geographic location of images -- such as those shared on social media or captured from street views -- based solely on visual content, thereby posing serious risks of privacy invasion, including doxxing, surveillance, and other security threats.
Methods: This study provides a comprehensive analysis of existing geolocation techniques based on MLLMs. It systematically reviews relevant litera-ture and evaluates the performance of state-of-the-art visual reasoning models on geolocation tasks, particularly in identifying the origins of street view imagery.
Results: Empirical evaluation reveals that the most advanced visual large models can successfully localize the origin of street-level imagery with up to $49\%$ accuracy within a 1-kilometer radius. This performance underscores the models' powerful capacity to extract and utilize fine-grained geographic cues from visual data.
Conclusions: Building on these findings, the study identifies key visual elements that contribute to suc-cessful geolocation, such as text, architectural styles, and environmental features. Furthermore, it discusses the potential privacy implications associated with MLLM-enabled geolocation and discuss several technical and policy-based coun-termeasures to mitigate associated risks. Our code and dataset are available at https://github.com/zxyl1003/MLLM-Geolocation-Evaluation.
中文: 多模态大语言模型在地理定位方面展现出强大能力,能通过视觉线索以高达49%的准确率识别街景图像位置,这引发了严重的隐私风险,亟需技术和政策层面的应对措施。
English: Multimodal Large Language Models (MLLMs) demonstrate significant geolocation capabilities, achieving up to 49% accuracy in pinpointing street-level imagery within a 1-kilometer radius, which raises serious privacy concerns and necessitates countermeasures.
Authors:Zhiwei Lin, Bonan Ruan, Jiahao Liu, Weibo Zhao
Abstract:
The Model Context Protocol (MCP) has recently emerged as a standardized interface for connecting language models with external tools and data. As the ecosystem rapidly expands, the lack of a structured, comprehensive view of existing MCP artifacts presents challenges for research. To bridge this gap, we introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP clients. Each artifact is annotated with 20+ normalized attributes capturing its identity, interface configuration, GitHub activity, and metadata. MCPCorpus provides a reproducible snapshot of the real-world MCP ecosystem, enabling studies of adoption trends, ecosystem health, and implementation diversity. To keep pace with the rapid evolution of the MCP ecosystem, we provide utility tools for automated data synchronization, normalization, and inspection. Furthermore, to support efficient exploration and exploitation, we release a lightweight web-based search interface. MCPCorpus is publicly available at: https://github.com/Snakinya/MCPCorpus.
中文: MCPCorpus作为一个大规模数据集被提出,包含约1.4万个MCP服务器和300个客户端,标注了20多项属性,旨在提供MCP生态系统的结构化视图,支持对采用趋势和实施多样性的研究。
English: MCPCorpus is introduced as a large-scale dataset with 14K MCP servers and 300 clients, annotated with over 20 attributes, to provide a structured view of the MCP ecosystem and support research on adoption trends and implementation diversity.
Authors:Xuan Yao, Junyu Gao, Changsheng Xu
Abstract:
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available at https://github.com/Feliciaxyao/NavMorph.
中文摘要:NavMorph是一个自演进的世界模型框架,通过采用紧凑潜在表征和上下文记忆来增强环境理解与自适应决策,在连续环境中的视觉语言导航任务上实现了显著的性能提升。
English Summary: NavMorph is a self-evolving world model framework that improves navigation in continuous environments by using compact latent representations and contextual memory to enhance environmental understanding and adaptive decision-making, achieving superior performance on VLN-CE benchmarks.
Authors:WonJune Jang
Abstract:
Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks. Our code available at: https://github.com/torijune/ATF-Adaptive-Table-Filtering-Framework
中文: ATF框架通过LLM生成的列描述和对齐分数自适应地过滤大型表格中的非信息性行列,将表格大小减少70%,在TableQA任务中提升性能,但在需要完整上下文的表格事实核查任务中略有下降。
English: The ATF framework adaptively filters uninformative columns and rows from large tables using LLM-generated descriptions and alignment scores, reducing table size by 70% while improving performance on TableQA tasks but slightly decreasing accuracy in Table Fact Verification where complete context is essential.
Authors:Alexander Kolpakov, Aidan Rocke
Abstract:
In the present paper we give a principled derivation of Elias' Omega code by combining a constrained variational formulation of prefix coding with a renormalization flow on codeword distributions. Starting from a Lagrangian that minimizes average code length under the Kraft-McMillan constraint, we show that the implied distribution is a fixed point of a coarse-graining map, yielding the canonical iterated logarithm length, up to an additive constant. This establishes completeness and asymptotic optimality, and connects universal integer coding with coarse-grained entropy, uncertainty-type bounds, and multiplicity relations familiar from statistical physics. The renormalization operator induces a discrete flow that converges to the Elias fixed point for any admissible initialization, up to a bounded error, offering a clean bridge between information-theoretic constraints and RG-style scale invariance.
中文摘要:本文通过结合约束变长编码和重整化流,从物理学原理推导出Elias Omega码,证明了其渐近最优性,并将其与统计物理中的熵关系联系起来。
English summary: This paper derives Elias' Omega code from physics principles by combining constrained variational prefix coding with renormalization flow, establishing its asymptotic optimality and connecting it to statistical physics entropy relations.
Authors:Alexander Kolpakov, Aidan Rocke
Abstract:
In the present paper we give a derivation of Elias' Omega code from physics principles by combining a constrained variational formulation of prefix coding with a renormalization flow on codeword distributions. Starting from a Lagrangian that minimizes average codelength under the Kraft-McMillan constraint, we show that the implied distribution is a fixed point of a coarse-graining map, yielding the canonical iterated log-sum length, asymptotically up to an additive constant. This establishes completeness and asymptotic optimality, and connects universal integer coding with coarse-grained entropy, uncertainty-type bounds, and entropy relations familiar from statistical physics.
中文摘要:本文通过结合约束变长编码和重整化流,从物理学原理推导出Elias Omega码,证明了其渐近最优性,并将其与统计物理中的熵关系联系起来。
English summary: This paper derives Elias' Omega code from physics principles by combining constrained variational prefix coding with renormalization flow, establishing its asymptotic optimality and connecting it to statistical physics entropy relations.
Authors:Tim Puphal, Vipul Ramtekkar, Kenji Nishimiya
Abstract:
Improving automated vehicle software requires driving data rich in valuable road user interactions. In this paper, we propose a risk-based filtering approach that helps identify such valuable driving situations from large datasets. Specifically, we use a probabilistic risk model to detect high-risk situations. Our method stands out by considering a) first-order situations (where one vehicle directly influences another and induces risk) and b) second-order situations (where influence propagates through an intermediary vehicle). In experiments, we show that our approach effectively selects valuable driving situations in the Waymo Open Motion Dataset. Compared to the two baseline interaction metrics of Kalman difficulty and Tracks-To-Predict (TTP), our filtering approach identifies complex and complementary situations, enriching the quality in automated vehicle testing. The risk data is made open-source: https://github.com/HRI-EU/RiskBasedFiltering.
中文摘要:本文提出一种基于风险的筛选方法,通过概率风险模型从大型数据集中识别包含直接风险和传播风险的高价值驾驶场景,相比传统交互指标能有效发现更复杂的互补情境,从而提升自动驾驶测试质量。
English Summary: This paper introduces a risk-based filtering method that uses a probabilistic model to identify valuable driving scenarios—including both direct and propagated risk interactions—from large datasets, effectively enhancing automated vehicle testing by uncovering complex situations missed by baseline metrics.
Authors:Heitor R. Medeiros, Hossein Sharifi-Noghabi, Gabriel L. Oliveira, Saghar Irandoust
Abstract:
Real-world time series often exhibit a non-stationary nature, degrading the performance of pre-trained forecasting models. Test-Time Adaptation (TTA) addresses this by adjusting models during inference, but existing methods typically update the full model, increasing memory and compute costs. We propose PETSA, a parameter-efficient method that adapts forecasters at test time by only updating small calibration modules on the input and output. PETSA uses low-rank adapters and dynamic gating to adjust representations without retraining. To maintain accuracy despite limited adaptation capacity, we introduce a specialized loss combining three components: (1) a robust term, (2) a frequency-domain term to preserve periodicity, and (3) a patch-wise structural term for structural alignment. PETSA improves the adaptability of various forecasting backbones while requiring fewer parameters than baselines. Experimental results on benchmark datasets show that PETSA achieves competitive or better performance across all horizons. Our code is available at: https://github.com/BorealisAI/PETSA
中文摘要:PETSA提出了一种参数高效的测试时自适应方法,仅通过更新输入输出的微型校准模块,结合低秩适配器和专门设计的损失函数,在减少参数量的同时实现了与现有方法相当或更优的时序预测性能。
English Summary: PETSA introduces a parameter-efficient test-time adaptation method for time series forecasting that updates only small calibration modules using low-rank adapters and a specialized loss function, achieving competitive performance with fewer parameters than existing approaches.
Authors:Nikola BaniÄ, Neven ElezoviÄ
Abstract:
Pearson's chi-squared test is widely used to assess the uniformity of discrete histograms, typically relying on a continuous chi-squared distribution to approximate the test statistic, since computing the exact distribution is computationally too costly. While effective in many cases, this approximation allegedly fails when expected bin counts are low or tail probabilities are needed. Here, Zero-disparity Distribution Synthesis is presented, a fast dynamic programming approach for computing the exact distribution, enabling detailed analysis of approximation errors. The results dispel some existing misunderstandings and also reveal subtle, but significant pitfalls in approximation that are only apparent with exact values. The Python source code is available at https://github.com/DiscreteTotalVariation/ChiSquared.
Chinese Summary: 本文提出的零差异分布合成法通过动态规划快速计算皮尔逊卡方检验的精确分布,不仅纠正了现有误解,还揭示了传统近似方法中难以察觉的重要缺陷。
English Summary: The paper introduces Zero-disparity Distribution Synthesis, a dynamic programming method that efficiently computes the exact distribution of Pearson's chi-squared test, overcoming approximation errors and revealing hidden pitfalls in traditional approaches.
Authors:Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Abstract:
Despite the growing reliance on fairness benchmarks to evaluate language models, the datasets that underpin these benchmarks remain critically underexamined. This survey addresses that overlooked foundation by offering a comprehensive analysis of the most widely used fairness datasets in language model research. To ground this analysis, we characterize each dataset across key dimensions, including provenance, demographic scope, annotation design, and intended use, revealing the assumptions and limitations baked into current evaluation practices. Building on this foundation, we propose a unified evaluation framework that surfaces consistent patterns of demographic disparities across benchmarks and scoring metrics. Applying this framework to sixteen popular datasets, we uncover overlooked biases that may distort conclusions about model fairness and offer guidance on selecting, combining, and interpreting these resources more effectively and responsibly. Our findings highlight an urgent need for new benchmarks that capture a broader range of social contexts and fairness notions. To support future research, we release all data, code, and results at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets, fostering transparency and reproducibility in the evaluation of language model fairness.
中文: 本研究对语言模型公平性评估数据集进行系统性分析,揭示潜在偏见并提出统一评估框架,发现人口统计差异,呼吁建立涵盖更广社会背景的新基准。
English: This survey critically analyzes widely used fairness datasets for language models, revealing embedded biases and proposing a unified evaluation framework that uncovers demographic disparities while advocating for more comprehensive benchmarks.
Authors:Vikram Rangarajan, Shishira Maiya, Max Ehrlich, Abhinav Shrivastava
Abstract:
Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD .
中文摘要:SIEDD提出了一种新颖架构,通过共享编码器和并行训练轻量解码器的方法,将隐式神经表示的编码速度提升20-30倍,同时保持重建质量和完整的坐标控制能力,显著推进了神经视频压缩技术的实际应用。
English Summary: SIEDD introduces a novel architecture that accelerates implicit neural representation encoding by 20-30 times while maintaining reconstruction quality and full coordinate-based control, making high-fidelity neural video compression practical for real-world deployment.
Authors:Xiao'ao Song, Konstantinos Karydis
Abstract:
Efficient identification of picking points is critical for automated fruit harvesting. Avocados present unique challenges owing to their irregular shape, weight, and less-structured growing environments, which require specific viewpoints for successful harvesting. We propose a geometry-based, semantics-aware viewpoint-planning algorithm to address these challenges. The planning process involves three key steps: viewpoint sampling, evaluation, and execution. Starting from a partially occluded view, the system first detects the fruit, then leverages geometric information to constrain the viewpoint search space to a 1D circle, and uniformly samples four points to balance the efficiency and exploration. A new picking score metric is introduced to evaluate the viewpoint suitability and guide the camera to the next-best view. We validate our method through simulation against two state-of-the-art algorithms. Results show a 100% success rate in two case studies with significant occlusions, demonstrating the efficiency and robustness of our approach. Our code is available at https://github.com/lineojcd/GSNBV
中文: 本文提出一种基于几何和语义感知的视点规划算法,通过高效采样和评估视点来克服遮挡问题,在牛油果自动化采摘中实现了100%的成功率。
English: This paper introduces a geometry-based, semantics-aware viewpoint-planning algorithm that achieves a 100% success rate in automated avocado harvesting by efficiently sampling and evaluating viewpoints to overcome occlusion challenges.
Authors:Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim
Abstract:
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
中文: 本研究针对二语学习者开发了首个文本转语音系统,通过延长元音时长差异的清晰模式显著减少了转录错误并提升了感知鼓励度,同时揭示了实际可懂度与感知可懂度的差异,以及自动语音识别系统在此类评估中的局限性。
English: This study introduces a specialized text-to-speech system for second language learners, featuring a clarity mode that enhances vowel duration contrast to significantly reduce transcription errors and improve perceived encouragement, while revealing a disconnect between actual and perceived intelligibility and limitations of automated speech recognition for this population.
Authors:Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille
Abstract:
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus
中文: 本文提出了VideoCus-Factory数据构建流程和OmniVCus扩散Transformer框架,实现了多主体视频定制与可控编辑,在性能上显著超越了现有先进方法。
English: This paper introduces VideoCus-Factory, a data construction pipeline for multi-subject video customization, and OmniVCus, a diffusion Transformer framework that enables instructive editing and surpasses existing methods in performance.
Authors:Yi Liu, Shengqian Li, Zuzeng Lin, Feng Wang, Si Liu
Abstract:
The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textit{e}.\textit{g}., CycleGAN-Turbo.
中文摘要:本文提出CycleVAR方法,通过Softmax松弛量化替代传统离散量化保持梯度传播,将图像翻译重构为条件自回归生成,其并行单步生成模式在无监督场景下实现了优于CycleGAN-Turbo等模型的图像转换质量与推理速度。
English Summary: This paper introduces CycleVAR, a novel unsupervised image translation method that replaces discrete vector quantization with Softmax Relaxed Quantization to maintain gradient flow and enables both serial and parallel generation modes, achieving superior performance over existing models like CycleGAN-Turbo.
Authors:Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu
Abstract:
Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high-quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codecs in balancing semantic richness and acoustic fidelity. We propose XY-Tokenizer, a novel codec that mitigates the conflict between semantic and acoustic capabilities through multi-stage, multi-task learning. Experimental results demonstrate that XY-Tokenizer achieves performance in both semantic and acoustic tasks comparable to that of state-of-the-art codecs operating at similar bitrates, even though those existing codecs typically excel in only one aspect. Specifically, XY-Tokenizer achieves strong text alignment, surpassing distillation-based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between reconstructed and original audio. The reconstruction performance of XY-Tokenizer is comparable to that of BigCodec, the current state-of-the-art among acoustic-only codecs, which achieves a speaker similarity score of 0.84 at a similar bitrate. Code and models are available at https://github.com/gyt1145028706/XY-Tokenizer.
中文: XY-Tokenizer是一种新型语音编解码器,通过多阶段多任务学习有效平衡了语义丰富性和声学保真度,在相近比特率下实现了文本对齐和音频重建的最优性能。
English: The XY-Tokenizer is a novel speech codec that effectively balances semantic richness and acoustic fidelity through multi-stage, multi-task learning, achieving state-of-the-art performance in both text alignment and audio reconstruction at comparable bitrates.
Authors:Yiming Huang, Long Bai, Beilei Cui, Kun Yuan, Guankun Wang, Mobarak I. Hoque, Nicolas Padoy, Nassir Navab, Hongliang Ren
Abstract:
In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices. SurgTPGS paves the way for developing next-generation intelligent surgical systems by enhancing surgical precision and safety. Our code is available at: https://github.com/lastbasket/SurgTPGS.
中文: 本文提出SurgTPGS方法,通过融合三维语义特征学习和形变跟踪技术,实现了可文本提示的实时三维手术场景理解,显著提升了手术精准度和安全性。
English: This paper introduces SurgTPGS, a text-promptable Gaussian Splatting method that integrates 3D semantic feature learning and deformation tracking to enable real-time 3D surgical scene understanding, demonstrating superior performance in surgical precision and safety.
Authors:Yiming Huang, Long Bai, Beilei Cui, Yanheng Li, Tong Chen, Jie Wang, Jinlin Wu, Zhen Lei, Hongbin Liu, Hongliang Ren
Abstract:
Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at https://github.com/lastbasket/Endo-4DGX.
中文: Endo-4DGX提出了一种光照自适应的高斯溅射方法,显著提升了内窥镜场景在多变光照下的三维重建效果,实现了卓越的渲染质量和几何精度。
English: Endo-4DGX introduces an illumination-adaptive Gaussian splatting method that enhances 3D reconstruction in endoscopic scenes under varying light conditions, achieving superior rendering quality and geometric accuracy.
Authors:Qi Liu, Can Li, Wanjing Ma
Abstract:
Traditional agent-based urban mobility simulations often rely on rigid rule-based systems that struggle to capture the complexity, adaptability, and behavioral diversity inherent in human travel decision making. Recent advancements in large language models and AI agent technologies present new opportunities to develop agents with enhanced reasoning capabilities, persistent memory, and adaptive learning. We introduce GATSim (Generative-Agent Transport Simulation), a novel framework that leverages these advancements to simulate urban mobility using generative agents with rich, human-like behaviors. Unlike conventional approaches, GATSim agents are characterized by diverse socioeconomic profiles, individual lifestyles, and evolving preferences shaped through psychologically informed memory systems, tool usage, and lifelong learning. The main contributions of this work are: (1) a comprehensive architecture that integrates an urban mobility foundation model with agent cognitive systems and a transport simulation environment; (2) a hierarchical memory designed for efficient retrieval of contextually relevant information, incorporating spatial and temporal associations, keyword matching, and semantic relevance; (3) innovative planning and reactive mechanisms for modeling adaptive mobility behaviors which integrate a multi-scale reflection process to transform specific travel experiences into generalized behavioral insights. We implement a prototype system and conduct systematic validation, demonstrating that generative agents produce believable and coherent travel behaviors. Experimental results indicate that generative agents perform at least as well as human annotators with 92\% posterior probability, while naturally producing realistic macroscopic traffic patterns. The code for the prototype implementation is publicly available at https://github.com/qiliuchn/gatsim.
中文: GATSim提出了一种创新的城市交通模拟框架,通过具备类人适应性和多元社会经济特征的生成式智能体,超越了传统基于规则的系统,能生成更真实的出行行为和交通模式。
English: GATSim introduces a novel urban mobility simulation framework using generative AI agents with human-like adaptability and diverse socioeconomic profiles, outperforming traditional rule-based systems by producing realistic travel behaviors and traffic patterns.
Authors:Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel
Abstract:
Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.
中文: 本研究提出的CALIN方法在推理时通过双层校准机制有效减轻了多模态大语言模型在医学图像分类中的校准偏差和人口统计不公平性,在提升准确率的同时确保了预测的公平性。
English: This study introduces CALIN, an inference-time calibration method that effectively mitigates calibration biases and demographic unfairness in multimodal large language models for medical image classification, improving accuracy and fairness across diverse datasets.
Authors:David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin
Abstract:
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim
中文: 本研究探讨大型语言模型在多智能体系统中如何平衡自身利益与集体福祉,发现增强推理能力未必促进合作,某些传统模型在维持协作行为方面反而优于专注推理的模型。
English: This study investigates how large language models (LLMs) balance self-interest and collective welfare in multi-agent systems, revealing that enhanced reasoning capabilities do not necessarily foster cooperation, with some traditional models outperforming reasoning-focused ones in sustaining collaborative behavior.
Authors:Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo
Abstract:
Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared $U$-matrices while enabling effective merging of the expert-specific $V$ components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared $U$-matrix across experts in the same group, then pursues frequency-based merging for individual $V$-matrices, and finalizes expert reconstruction using the merged $V$-matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%|86\% of original performance with 25\%|50\% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at https://github.com/lliai/MoERazor.
Chinese: Sub-MoE是一种新颖的专家混合模型压缩框架,通过专家聚类和基于联合奇异值分解的子空间专家合并,有效解决参数冲突问题,在保持高性能的同时显著提升模型效率。
English: Sub-MoE is a novel compression framework that addresses parameter conflicts in Mixture of Experts LLMs by clustering experts and merging them in a shared subspace using joint SVD, achieving significant efficiency gains while maintaining high performance.
Authors:Lunhao Duan, Shanshan Zhao, Xingxing Weng, Jing Zhang, Gui-Song Xia
Abstract:
This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at https://github.com/LHDuan/WSegPC .
中文摘要:本文针对场景级标注下的室内点云语义分割提出新框架,通过跨模态特征引导和区域-点语义一致性模块生成高质量伪标签,在多个数据集上实现了显著性能提升。
English Summary: This paper proposes a novel framework for indoor point cloud semantic segmentation using scene-level annotations, employing cross-modal feature guidance and region-point consistency modules to generate high-quality pseudo-labels and achieve state-of-the-art performance.
Authors:Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
Abstract:
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
中文摘要:UrbanLLaVA是一个多模态大语言模型,通过构建城市指令数据集和多阶段训练框架,能够同时处理多种城市数据类型,在各类城市任务中表现优于现有模型。
English Summary: UrbanLLaVA is a multi-modal large language model designed to process diverse urban data types simultaneously, outperforming existing models across various urban tasks through a curated dataset and multi-stage training framework.
Authors:Haoran Li, Muhao Guo, Marija Ilic, Yang Weng, Guangchun Ruan
Abstract:
Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm shift: external data should serve as meta-knowledge to dynamically adapt the forecasting model itself. Based on this idea, we design a meta-representation framework using hypernetworks that modulate selected parameters of a base Deep Learning (DL) model in response to external conditions. This provides both expressivity and adaptability. We further integrate a Mixture-of-Experts (MoE) mechanism to enhance efficiency through selective expert activation, while improving robustness by filtering redundant external inputs. The resulting model, dubbed as a Meta Mixture of Experts for External data (M2oE2), achieves substantial improvements in accuracy and robustness with limited additional overhead, outperforming existing state-of-the-art methods in diverse load datasets. The dataset and source code are publicly available at https://github.com/haorandd/M2oE2\_load\_forecast.git.
中文摘要:作者提出创新的M2oE2框架,将外部数据作为元知识,通过超网络和专家混合机制动态调整预测模型,以最小计算开销显著提升了住宅负荷预测的准确性与鲁棒性。
English Summary: The authors propose a novel M2oE2 framework that uses external data as meta-knowledge to dynamically adapt forecasting models through hypernetworks and mixture-of-experts, achieving superior accuracy and robustness in residential load prediction with minimal overhead.
Authors:Gabriel Iturra-Bocaz, Felipe Bravo-Marquez
Abstract:
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.
This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.
中文摘要:RiverText是一个Python库,用于从文本数据流中训练和评估增量词嵌入,通过集成Skip-gram和CBOW等动态更新技术,解决了传统静态词嵌入模型无法适应语言演变的问题,并提供了标准化评估框架。
English Summary: RiverText is a Python library designed for training and evaluating incremental word embeddings from text streams, addressing the limitations of static models by dynamically updating word representations using techniques like Skip-gram and CBOW within a PyTorch framework.
Authors:Vladislav Bargatin, Egor Chistov, Alexander Yakovenko, Dmitriy Vatolin
Abstract:
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at https://github.com/msu-video-group/memfof.
中文:MEMFOF是一种内存高效的多帧光流方法,在显著降低GPU内存消耗的同时,于多个基准测试中达到最先进精度,并实现了无需裁剪或下采样的全1080p训练。
English: MEMFOF is a memory-efficient multi-frame optical flow method that achieves state-of-the-art accuracy on multiple benchmarks while significantly reducing GPU memory consumption, enabling full 1080p training without cropping or downsampling.
Authors:Shahad Hardan, Darya Taratynova, Abdelmajid Essofi, Karthik Nandakumar, Mohammad Yaqub
Abstract:
Privacy preservation in AI is crucial, especially in healthcare, where models rely on sensitive patient data. In the emerging field of machine unlearning, existing methodologies struggle to remove patient data from trained multimodal architectures, which are widely used in healthcare. We propose Forget-MI, a novel machine unlearning method for multimodal medical data, by establishing loss functions and perturbation techniques. Our approach unlearns unimodal and joint representations of the data requested to be forgotten while preserving knowledge from the remaining data and maintaining comparable performance to the original model. We evaluate our results using performance on the forget dataset, performance on the test dataset, and Membership Inference Attack (MIA), which measures the attacker's ability to distinguish the forget dataset from the training dataset. Our model outperforms the existing approaches that aim to reduce MIA and the performance on the forget dataset while keeping an equivalent performance on the test set. Specifically, our approach reduces MIA by 0.202 and decreases AUC and F1 scores on the forget set by 0.221 and 0.305, respectively. Additionally, our performance on the test set matches that of the retrained model, while allowing forgetting. Code is available at https://github.com/BioMedIA-MBZUAI/Forget-MI.git
Chinese: 提出的Forget-MI方法通过消除特定表征,在保持剩余数据性能的同时,有效从医疗多模态AI模型中移除敏感患者数据,并在降低成员推理攻击和遗忘集指标方面优于现有方法。
English: The proposed Forget-MI method effectively removes sensitive patient data from multimodal AI models in healthcare by unlearning specific representations while maintaining performance on retained data and outperforming existing approaches in reducing membership inference attacks and forget set metrics.
Authors:Siyuan Li, Ruitong Liu, Yan Wen, Te Sun, Andi Zhang, Yanbiao Ma, Xiaoshuai Hao
Abstract:
Knowledge graph completion demands effective modeling of multifaceted semantic relationships between entities. Yet, prevailing methods, which rely on static scoring functions over learned embeddings, struggling to simultaneously capture rich semantic context and the dynamic nature of relations. To overcome this limitation, we propose the Flow-Modulated Scoring (FMS) framework, conceptualizing a relation as a dynamic evolutionary process governed by its static semantic environment. FMS operates in two stages: it first learns context-aware entity embeddings via a Semantic Context Learning module, and then models a dynamic flow between them using a Conditional Flow-Matching module. This learned flow dynamically modulates a base static score for the entity pair. By unifying context-rich static representations with a conditioned dynamic flow, FMS achieves a more comprehensive understanding of relational semantics. Extensive experiments demonstrate that FMS establishes a new state of the art across both canonical knowledge graph completion tasks: relation prediction and entity prediction. On the standard relation prediction benchmark FB15k-237, FMS achieves a near-perfect MRR of 99.8\% and Hits@1 of 99.7\% using a mere 0.35M parameters, while also attaining a 99.9\% MRR on WN18RR. Its dominance extends to entity prediction, where it secures a 25.2\% relative MRR gain in the transductive setting and substantially outperforms all baselines in challenging inductive settings. By unifying a dynamic flow mechanism with rich static contexts, FMS offers a highly effective and parameter-efficient new paradigm for knowledge graph completion. Code published at: https://github.com/yuanwuyuan9/FMS.
Chinese: 提出的流调制评分(FMS)框架通过结合上下文感知的静态嵌入与动态流机制,在关系和实体预测任务中实现了最先进的性能,并具有高效的参数利用率。
English: The proposed Flow-Modulated Scoring (FMS) framework enhances knowledge graph completion by integrating context-aware static embeddings with a dynamic flow mechanism, achieving state-of-the-art performance in both relation and entity prediction tasks with high parameter efficiency.
Authors:Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, Yong Li
Abstract:
World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.
中文:RoboScape提出了一种统一的物理感知世界模型,通过整合三维几何与运动动力学来提升视频生成的真实性,并有效支持机器人策略的训练与评估。
English: RoboScape introduces a unified physics-informed world model that enhances video generation by integrating 3D geometry and motion dynamics, improving realism and enabling effective robotic policy training and evaluation.
Authors:Chi Chiu So, Yueyue Sun, Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, Chun Pong Chau
Abstract:
How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeek-R1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeek-R1's long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs' internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs' reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning.
中文: 本研究评估了三种领先大语言模型在关系推理任务中的表现,发现DeepSeek-R1表现最佳但所有模型均受限于标记长度和推理不完整问题,揭示了改进推理机制的必要性。
English: This study evaluates three leading LLMs on relational reasoning tasks, finding that DeepSeek-R1 outperforms others but all models struggle with complex problems due to token limits and flawed reasoning, highlighting the need for improved inference mechanisms.
Authors:Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge
Abstract:
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
中文摘要:CRISP-SAM2是一种新颖的多器官分割模型,通过跨模态交互和语义提示技术提升医学图像分析的细节精度并摆脱对几何提示的依赖,在多个公开数据集上展现出优越性能。
English Summary: CRISP-SAM2 is a novel multi-organ segmentation model that enhances medical image analysis through cross-modal interaction and semantic prompting, achieving superior performance by improving detail accuracy and eliminating reliance on geometric prompts.
Authors:Jian Shi, Tianqi You, Pingping Zhang, Hongli Zhang, Rui Xu, Haojie Li
Abstract:
Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at https://github.com/anaanaa/FMCNet.
中文:FMC-Net模型通过小波变换分别处理图像的高频和低频成分,优化细节并捕捉长距离依赖,从而提升三维CT和MRI中椎骨分割的准确性。
English: The FMC-Net model enhances vertebrae segmentation in 3D CT and MRI by using wavelet transforms to process high and low frequencies separately, refining details and capturing long-range dependencies to improve accuracy.
Authors:Suofei Zhang, Xinxin Wang, Xiaofu Wu, Quan Zhou, Haifeng Hu
Abstract:
Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Distance-Aware Campus (DA-Campus), the first benchmark that pairs multi-view imagery with precise distance annotations across three spatial resolutions. Based on DA-Campus, we formulate DACVGL as a hierarchical retrieval problem across different domains. Our study further reveals that, due to the inherent complexity of spatial relationships among buildings, this problem can only be addressed via a contrastive learning paradigm, rather than conventional metric learning. To tackle this challenge, we propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins. Extensive experiments demonstrate that DyCL is highly complementary to existing multi-scale metric learning methods and yields substantial improvements in both hierarchical retrieval performance and overall cross-view geo-localization accuracy. Our code and benchmark are publicly available at https://github.com/anocodetest1/DyCL.
中文摘要:现有跨视角地理定位方法侧重匹配精度而忽略环境上下文,为此我们构建了首个距离感知基准DA-Campus并提出动态对比学习框架DyCL,通过分层特征对齐显著提升了跨域层次检索与定位性能。
English Summary: Current cross-view geo-localization methods prioritize matching accuracy over contextual understanding, prompting the development of the DA-Campus benchmark and Dynamic Contrastive Learning (DyCL) framework to address hierarchical distance-aware localization through progressive feature alignment.
Authors:Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Abstract:
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: https://github.com/yzheng97/CDAL.
中文摘要:本文提出反事实解耦注意力学习(CDAL)方法,通过因果建模分离判别性模型特征与混杂偏差,在计算开销极小的前提下显著提升开放世界模型溯源性能,尤其针对未知攻击具有卓越泛化能力。
English Summary: This paper introduces Counterfactually Decoupled Attention Learning (CDAL), a novel method that uses causal modeling to separate discriminative model artifacts from confounding biases, significantly improving open-world model attribution performance for unseen attacks with minimal computational cost.
Authors:Zhengren Wang, Bozhou Li, Dongwen Yao, Wentao Zhang
Abstract:
While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.
Chinese: Text2VectorSQL 是一个创新框架,将 Text-to-SQL 与向量搜索相结合,以克服表达限制并支持更多样化的自然语言查询,通过定制模型和评估方法展现出显著的性能提升。
English: Text2VectorSQL is a novel framework that integrates Text-to-SQL and vector search to enhance query expressiveness and support diverse natural language interactions, demonstrating significant performance improvements through tailored models and evaluation methods.
Authors:Jiazhen Liu, Yuchuan Deng, Long Chen
Abstract:
Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME
中文摘要:DyME训练范式通过在优化过程中动态切换记忆与探索模式,有效解决了小规模视觉语言模型的能力限制,在不同领域均实现了显著的性能提升。
English Summary: The proposed DyME training paradigm dynamically alternates between memorization and exploration modes during optimization to overcome the limitations of small-scale vision-language models, achieving significant performance improvements across various domains.
Authors:Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen
Abstract:
Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.
Chinese: 本文提出的K-MSE框架通过引入外部化学知识库和专用分子谱评估器,显著提升了大型语言模型在分子结构解析任务中的性能,在GPT-4系列模型上实现了超过20%的性能提升。
English: This paper introduces K-MSE, a knowledge-enhanced framework that significantly improves molecular structure elucidation in large language models by integrating external chemical knowledge and a specialized reward mechanism, achieving over 20% performance gains on GPT-4 models.
Authors:Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Abstract:
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
Ovis-U1 是一个 30 亿参数的统一模型,通过创新的统一训练方法,在多模态理解、文生图和图像编辑任务中均实现了领先性能,并在多个基准测试中取得最优成绩。
Ovis-U1 is a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance across multiple benchmarks through a novel unified training approach.
Authors:Jie Liu, Jiayi Shen, Pan Zhou, Jan-Jakob Sonke, Efstratios Gavves
Abstract:
Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.
中文:FewCLIP提出了一种概率原型校准框架,通过视觉校准和分布正则化优化多模态原型,在广义少样本语义分割中显著提升了适应性和泛化能力,在基准数据集上取得了领先性能。
English: FewCLIP introduces a probabilistic prototype calibration framework that enhances adaptability and generalization in Generalized Few-Shot Semantic Segmentation by refining multi-modal prototypes with visual calibration and distribution regularization, achieving state-of-the-art results on benchmark datasets.
Authors:Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Abstract:
Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.
中文摘要:本文提出了一个基于成分句法树的组合式句法语言模型统一框架,通过多任务实验评估模型性能,并根据实验结果提出了多项设计建议。
English Summary: This paper introduces a unified framework for compositional syntactic language models that incorporate constituency parse trees, evaluates their performance across various tasks, and provides design recommendations based on comprehensive experiments.
Authors:Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
Abstract:
We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
中文: 这项复制研究证实了Ortu等人关于语言模型中注意力机制处理事实与反事实信息的发现,同时扩展研究揭示了更大模型中注意力头专门化程度降低、对提示结构的敏感性增强,以及所提出消融方法在不同领域有效性存在差异的局限性。
English: This reproduction study confirms Ortu et al.'s findings about attention mechanisms handling factual and counterfactual information in language models, while extending the research to reveal limitations in attention head specialization across larger models, sensitivity to prompt structures, and domain-specific effectiveness of their proposed ablation method.
Authors:AmirHossein Naghi Razlighi, Elaheh Badali Golezani, Shohreh Kasaei
Abstract:
3D Gaussian Splatting enables high-quality real-time rendering but often produces millions of splats, resulting in excessive storage and computational overhead. We propose a novel lossy compression method based on learnable confidence scores modeled as Beta distributions. Each splat's confidence is optimized through reconstruction-aware losses, enabling pruning of low-confidence splats while preserving visual fidelity. The proposed approach is architecture-agnostic and can be applied to any Gaussian Splatting variant. In addition, the average confidence values serve as a new metric to assess the quality of the scene. Extensive experiments demonstrate favorable trade-offs between compression and fidelity compared to prior work. Our code and data are publicly available at https://github.com/amirhossein-razlighi/Confident-Splatting
中文摘要:本文提出了一种基于可学习置信度分数的创新性有损压缩方法,通过优化高斯分布的置信度实现高效场景压缩,在保持视觉质量的同时显著提升了压缩性能。
English Summary: This paper introduces a novel lossy compression method for 3D Gaussian Splatting that uses learnable confidence scores to prune unnecessary splats while maintaining visual quality, achieving superior compression-performance trade-offs compared to existing methods.
Authors:Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
Abstract:
As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.
Chinese: 本研究提出对话者意识作为大型语言模型识别和适应对话伙伴的关键能力,揭示了其在提升多智能体协作效率的同时,也带来了奖励破解和越狱风险等新型安全隐患。
English: This study introduces interlocutor awareness as a critical capability for large language models (LLMs) to identify and adapt to dialogue partners, demonstrating its dual role in enhancing multi-agent collaboration while introducing new safety vulnerabilities like reward hacking and jailbreak risks.
Authors:David RodrÃguez-MartÃnez, Dave van der Meer, Junlin Song, Abishek Bera, C. J. Pérez-del-Pulgar, Miguel Angel Olivares-Mendez
Abstract:
Exploring high-latitude lunar regions presents an extremely challenging visual environment for robots. The low sunlight elevation angle and minimal light scattering result in a visual field dominated by a high dynamic range featuring long, dynamic shadows. Reproducing these conditions on Earth requires sophisticated simulators and specialized facilities. We introduce a unique dataset recorded at the LunaLab from the SnT - University of Luxembourg, an indoor test facility designed to replicate the optical characteristics of multiple lunar latitudes. Our dataset includes images, inertial measurements, and wheel odometry data from robots navigating seven distinct trajectories under multiple illumination scenarios, simulating high-latitude lunar conditions from dawn to night time with and without the aid of headlights, resulting in 88 distinct sequences containing a total of 1.3M images. Data was captured using a stereo RGB-inertial sensor, a monocular monochrome camera, and for the first time, a novel single-photon avalanche diode (SPAD) camera. We recorded both static and dynamic image sequences, with robots navigating at slow (5 cm/s) and fast (50 cm/s) speeds. All data is calibrated, synchronized, and timestamped, providing a valuable resource for validating perception tasks from vision-based autonomous navigation to scientific imaging for future lunar missions targeting high-latitude regions or those intended for robots operating across perceptually degraded environments. The dataset can be downloaded from https://zenodo.org/records/13970078?preview=1, and a visual overview is available at https://youtu.be/d7sPeO50_2I. All supplementary material can be found at https://github.com/spaceuma/spice-hl3.
中文: 研究人员创建了一个模拟月球高纬度光照条件的独特数据集,包含130万张通过多种传感器(包括新型SPAD相机)采集的图像,旨在为月球任务的自主导航研究提供支持。
English: Researchers have developed a novel dataset simulating high-latitude lunar lighting conditions, featuring 1.3 million images captured with multiple sensor types including a SPAD camera, to support autonomous navigation research for lunar missions.
Authors:Marc Bara Iniesta
Abstract:
The ambiguity function is fundamental to radar waveform design, characterizing range and Doppler resolution capabilities. However, its traditional formulation involves non-differentiable operations, preventing integration with gradient-based optimization methods and modern machine learning frameworks. This paper presents the first complete mathematical framework and computational implementation for differentiable radar ambiguity functions. Our approach addresses the fundamental technical challenges that have prevented the radar community from leveraging automatic differentiation: proper handling of complex-valued gradients using Wirtinger calculus, efficient computation through parallelized FFT operations, numerical stability throughout cascaded operations, and composability with arbitrary differentiable operations. We term this approach GRAF (Gradient-based Radar Ambiguity Functions), which reformulates the ambiguity function computation to maintain mathematical equivalence while enabling gradient flow through the entire pipeline. The resulting implementation provides a general-purpose differentiable ambiguity function compatible with modern automatic differentiation frameworks, enabling new research directions including neural network-based waveform generation with ambiguity constraints, end-to-end optimization of radar systems, and integration of classical radar theory with modern deep learning. We provide complete implementation details and demonstrate computational efficiency suitable for practical applications. This work establishes the mathematical and computational foundation for applying modern machine learning techniques to radar waveform design, bridging classical radar signal processing with automatic differentiation frameworks.
中文摘要:本文提出了GRAF这一新型可微分雷达模糊函数框架,实现了基于梯度的优化方法,并能与机器学习结合用于先进雷达波形设计。
English Summary: This paper introduces GRAF, a novel differentiable radar ambiguity function framework that enables gradient-based optimization and integration with machine learning for advanced radar waveform design.
Authors:Sina Tabakhi, Haiping Lu
Abstract:
A key challenge in learning from multimodal biological data is missing modalities, where all data from some modalities are missing for some patients. Current fusion methods address this by excluding patients with missing modalities, imputing missing modalities, or making predictions directly with partial modalities. However, they often struggle with diverse missing-modality patterns and the exponential growth of the number of such patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) for direct prediction with partial modalities, which introduces a patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their importance and missingness. MAGNET's complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and the connectivity determined by the modality missingness, followed by a conventional graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world instead of artificial missingness, show that MAGNET outperforms the state-of-the-art fusion methods. The data and code are available at https://github.com/SinaTabakhi/MAGNET.
中文: MAGNET是一种新型图神经网络,通过患者-模态注意力机制和图结构融合有效处理多模态生物数据中的缺失模态问题,在癌症分类任务中以线性增长的复杂度实现了优于现有方法的性能。
English: MAGNET is a novel graph neural network that effectively handles diverse missing-modality patterns in multimodal biological data by using patient-modality attention and graph-based fusion, achieving superior performance in cancer classification with linearly scalable complexity.
Authors:Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Abstract:
Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.
Chinese: MOTOR提出了一种多模态检索与重排序方法,利用基础描述和最优传输融合视觉与文本信息,显著提升了医疗视觉问答中检索内容的临床相关性,平均准确率比现有最优方法高出6.45%。
English: MOTOR introduces a multimodal retrieval and re-ranking method that integrates visual and textual information through grounded captions and optimal transport, significantly enhancing the relevance of retrieved contexts for medical visual question answering and achieving a 6.45% average accuracy improvement over existing methods.
Authors:Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang
Abstract:
Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-uniGuard.
Chinese: 协作感知(CP)通过共享感知数据提升多智能体系统性能,但易受恶意攻击;为此提出的CP-uniGuard框架采用共识机制和自适应阈值,能有效检测并排除恶意智能体,确保系统在动态环境中的可靠性。
English: Collaborative Perception (CP) enhances multi-agent systems by sharing perception data but is vulnerable to malicious attacks, leading to the development of CP-uniGuard, a framework that detects and eliminates such threats through consensus-based methods and adaptive thresholds to ensure system reliability.
Authors:Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Yichu liu, Jizhao Liu, Juncheng Li, Teng Wang
Abstract:
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.
中文总结:提出的DeSa2VA方法通过文本预训练和线性解耦模块的增强提示方案,有效解决了现有模型中视觉与语义信息纠缠的问题,在多项视觉语言任务中实现了最优性能。
English Summary: The proposed DeSa2VA method introduces a decoupling-enhanced prompting scheme with text pre-training and linear feature disentanglement to overcome the visual-semantic entanglement in existing models, achieving state-of-the-art performance across multiple vision-language tasks.
Authors:Kamil Faber, Marcin PietroÅ, Dominik Å»urek, Roberto Corizzo
Abstract:
The recently proposed xLSTM is a powerful model that leverages expressive multiplicative gating and residual connections, providing the temporal capacity needed for long-horizon forecasting and representation learning. This architecture has demonstrated success in time series forecasting, lossless compression, and even large-scale language modeling tasks, where its linear memory footprint and fast inference make it a viable alternative to Transformers. Despite its growing popularity, no prior work has explored xLSTM for anomaly detection. In this work, we fill this gap by proposing xLSTMAD, the first anomaly detection method that integrates a full encoder-decoder xLSTM architecture, purpose-built for multivariate time series data. Our encoder processes input sequences to capture historical context, while the decoder is devised in two separate variants of the method. In the forecasting approach, the decoder iteratively generates forecasted future values xLSTMAD-F, while the reconstruction approach reconstructs the input time series from its encoded counterpart xLSTMAD-R. We investigate the performance of two loss functions: Mean Squared Error (MSE), and Soft Dynamic Time Warping (SoftDTW) to consider local reconstruction fidelity and global sequence alignment, respectively. We evaluate our method on the comprehensive TSB-AD-M benchmark, which spans 17 real-world datasets, using state-of-the-art challenging metrics such as VUS-PR. In our results, xLSTM showcases state-of-the-art accuracy, outperforming 23 popular anomaly detection baselines. Our paper is the first work revealing the powerful modeling capabilities of xLSTM for anomaly detection, paving the way for exciting new developments on this subject. Our code is available at: https://github.com/Nyderx/xlstmad
中文: 本文提出了xLSTMAD,这是首个采用完整编码器-解码器xLSTM架构的多变量时间序列异常检测方法,在基准测试中实现了最先进的精度,并超越了23种基线方法。
English: This paper introduces xLSTMAD, the first anomaly detection method using a full encoder-decoder xLSTM architecture for multivariate time series, which achieves state-of-the-art accuracy on benchmark datasets and outperforms 23 baseline methods.
Authors:Ramya Hebbalaguppe, Tamoghno Kandar, Abhinav Nagpal, Chetan Arora
Abstract:
Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications.
We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt
Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: https://github.com/rhebbalaguppe/TCA_PromptWithoutPanic.
中文: 视觉语言模型通过测试时提示调优可提升性能,但易导致过拟合和置信度校准不佳;我们提出的方法利用大语言模型初始化提示并结合新型正则化损失,有效改善了多个数据集的校准效果。
English: Vision-language models can be enhanced through test-time prompt tuning, but this often leads to overfitting and poor confidence calibration, which our proposed method mitigates by using large language model initialization and a novel regularization loss to significantly improve calibration across multiple datasets.
Authors:Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu
Abstract:
While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.
中文摘要:本文提出了MedEthicsQA这一全面评估大语言模型医疗伦理能力的基准,发现尽管经过严格质量把控,现有医疗大模型在伦理问题上的表现仍逊于其基础版本。
English Summary: The paper introduces MedEthicsQA, a comprehensive benchmark for evaluating medical ethics in large language models, revealing that current MedLLMs perform worse on ethical questions compared to their base versions despite rigorous dataset validation.
Authors:Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok
Abstract:
Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at https://github.com/NZWANG/FreqDGT.
中文: 针对脑电情绪识别中的跨被试泛化难题,FreqDGT模型通过频率自适应处理、动态脑连接学习和多尺度时序解耦,有效提升了识别精度与个体差异鲁棒性。
English: EEG-based emotion recognition faces cross-subject generalization challenges, which FreqDGT addresses through frequency-adaptive processing, dynamic brain connectivity learning, and multi-scale temporal modeling to significantly improve accuracy and robustness.
Authors:Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun
Abstract:
Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE
中文摘要:本文提出概念精准擦除器(CPE),通过非线性残差注意力门控和对抗性训练,在扩散模型中有效擦除目标概念的同时保持其余概念的多样性,并具备抗攻击鲁棒性。
English summary: This paper introduces Concept Pinpoint Eraser (CPE), a novel framework that uses nonlinear Residual Attention Gates and adversarial training to effectively erase target concepts from diffusion models while preserving diverse remaining concepts and maintaining robustness against attacks.
Authors:Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Lin Mei, Peiyi Shen, Liang Zhang
Abstract:
Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.
Chinese: 本文提出了增强人机互理解的概念瓶颈模型(CBM-HNMU),通过自动识别并优化有害概念来提升模型可解释性与准确率,在多个数据集上最高实现了2.64%的精度提升。
English: This paper introduces the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU), which automatically identifies and refines detrimental concepts to improve both model interpretability and accuracy, achieving up to a 2.64% accuracy gain across multiple datasets.
Authors:Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang
Abstract:
A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS.
Chinese: 提出的RGE-GS框架通过将扩散先验与奖励引导的高斯优化相结合,克服了场景扩展中的局限性,借助选择性模式保留和自适应训练策略实现了最先进的重建质量。
English: The proposed RGE-GS framework overcomes limitations in scene expansion by integrating diffusion priors with reward-guided Gaussian optimization, achieving state-of-the-art reconstruction quality through selective pattern retention and adaptive training strategies.
Authors:Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Abstract:
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
中文: 深度伪造攻击日益严重,但现有数据集缺乏真实性,因此我们提出了PhonemeFake,通过语言推理操纵关键语音段,显著降低人类感知和基准准确率,并推出高效检测模型,提升性能与速度。
English: Deepfake attacks are advancing, but current datasets lack realism, prompting the introduction of PhonemeFake, which manipulates speech segments to significantly reduce human perception and benchmark accuracy, alongside an efficient detection model that improves performance and speed.
Authors:Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Abstract:
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
中文: 深度伪造攻击日益严重,但现有数据集缺乏真实性,因此我们提出了PhonemeFake,通过语言推理操纵关键语音段,显著降低人类感知和基准准确率,并推出高效检测模型,提升性能与速度。
English: Deepfake attacks are advancing, but current datasets lack realism, prompting the introduction of PhonemeFake, which manipulates speech segments to significantly reduce human perception and benchmark accuracy, alongside an efficient detection model that improves performance and speed.
Authors:Yanran Wu, Inez Hua, Yi Ding
Abstract:
Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at https://github.com/jojacola/SCARF.
中文: SCARF作为首个考虑时空水资源压力变化的计算水影响评估框架,通过案例研究揭示了优化时空选择以实现水资源可持续计算的潜力。
English: SCARF is the first framework to assess computing's water impact by incorporating spatial and temporal water stress variations, revealing optimization opportunities for water-sustainable computing through case studies.
Authors:Havvanur DerviÅoÄlu, RuÅen Halepmollası, Elif Eyvaz
Abstract:
Bug severity prediction is a critical task in software engineering as it enables more efficient resource allocation and prioritization in software maintenance. While AI-based analyses and models significantly require access to extensive datasets, industrial applications face challenges due to data-sharing constraints and the limited availability of labeled data. In this study, we investigate method-level bug severity prediction using source code metrics and Large Language Models (LLMs) with two widely used datasets. We compare the performance of models trained using centralized learning, federated learning, and synthetic data generation. Our experimental results, obtained using two widely recognized software defect datasets, indicate that models trained with federated learning and synthetic data achieve comparable results to centrally trained models without data sharing. Our finding highlights the potential of privacy-preserving approaches such as federated learning and synthetic data generation to enable effective bug severity prediction in industrial context where data sharing is a major challenge.
The source code and dataset are available at our GitHub repository: https://github.com/drvshavva/EASE2025-Privacy-Preserving-Methods-for-Bug-Severity-Prediction.
中文: 本研究证明,联邦学习和合成数据生成能够在无需共享数据的情况下实现有效的缺陷严重性预测,其性能与集中式训练模型相当,同时解决了工业应用中面临的数据隐私难题。
English: This study demonstrates that federated learning and synthetic data generation enable effective bug severity prediction without data sharing, achieving performance comparable to centralized models while addressing privacy concerns in industrial applications.
Authors:Jiang Yuan, JI Ma, Bo Wang, Guanzhou Ke, Weiming Hu
Abstract:
Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: https://github.com/MJ-NCEPU/LightBSR.
中文摘要:本文提出了一种轻量级盲超分辨率模型LightBSR,通过知识蒸馏和特征对齐技术优化隐式退化表示的可区分性,在保持最低复杂度的同时实现了卓越的性能。
English Summary: The paper introduces LightBSR, a lightweight blind super-resolution model that enhances implicit degradation representation discriminability through knowledge distillation and feature alignment, achieving high performance with minimal complexity.
Authors:Brian Mak, Jeffrey Flanigan
Abstract:
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.
残差矩阵变换器(RMT)用外积记忆矩阵替代了标准残差流,以更少的资源实现了更高的效率和性能,同时允许残差流独立扩展。
The Residual Matrix Transformer (RMT) replaces the standard residual stream with an outer product memory matrix, achieving superior efficiency and performance with fewer resources while enabling independent scaling of the residual stream.
Authors:Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
Abstract:
In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://github.com/tuananhbui89/Embedding-Adjustment
中文摘要:本文针对生成式个性化中的语义坍缩问题,提出无需训练的嵌入调整方法,有效防止学习到的视觉概念主导多概念提示,从而保持语义丰富性并提升图文对齐效果。
English Summary: This paper addresses the semantic collapsing issue in generative personalization where learned visual concepts dominate multi-concept prompts, proposing a training-free embedding adjustment method to maintain semantic richness and improve text-image alignment.
Authors:Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Abstract:
The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.
中文摘要:CaO$_2$框架通过两阶段优化方法解决了基于扩散的数据集蒸馏中的目标与条件不一致问题,在ImageNet数据集上实现了最优性能,准确率比现有最佳方法平均提升2.3%。
English Summary: The CaO$_2$ framework addresses inconsistencies in diffusion-based dataset distillation by aligning the process with evaluation objectives through a two-stage approach, achieving state-of-the-art performance on ImageNet with a 2.3% accuracy improvement over baselines.
Authors:Arunkumar Kannan, Martin A. Lindquist, Brian Caffo
Abstract:
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this https://github.com/arunkumar-kannan/BrainMT-fMRI
中文:BrainMT框架通过结合双向Mamba模块和Transformer,有效捕捉fMRI数据中的长程时空特征,在大型神经影像数据集上的分类与回归任务中均实现了最先进的性能表现。
English: The BrainMT framework introduces a hybrid approach using bidirectional Mamba blocks and transformers to effectively capture long-range spatiotemporal dependencies in fMRI data, achieving state-of-the-art results in both classification and regression tasks on major neuroimaging datasets.
Authors:Vasilis Siomos, Jonathan Passerat-Palmbach, Giacomo Tarroni
Abstract:
Federated learning is a decentralized training approach that keeps data under stakeholder control while achieving superior performance over isolated training. While inter-institutional feature discrepancies pose a challenge in all federated settings, medical imaging is particularly affected due to diverse imaging devices and population variances, which can diminish the global model's effectiveness. Existing aggregation methods generally fail to adapt across varied circumstances. To address this, we propose FedCLAM, which integrates \textit{client-adaptive momentum} terms derived from each client's loss reduction during local training, as well as a \textit{personalized dampening factor} to curb overfitting. We further introduce a novel \textit{intensity alignment} loss that matches predicted and ground-truth foreground distributions to handle heterogeneous image intensity profiles across institutions and devices. Extensive evaluations on two datasets show that FedCLAM surpasses eight cutting-edge methods in medical segmentation tasks, underscoring its efficacy. The code is available at https://github.com/siomvas/FedCLAM.
中文: FedCLAM通过整合客户端自适应动量项和个性化抑制因子来缓解联邦学习中的特征差异和过拟合问题,同时引入强度对齐损失处理异构图像数据,在医学分割任务中显著优于现有方法。
English: FedCLAM enhances federated learning in medical imaging by incorporating client-adaptive momentum and a personalized dampening factor to mitigate feature discrepancies and overfitting, while introducing an intensity alignment loss to handle heterogeneous image data, achieving superior performance in segmentation tasks compared to existing methods.
Authors:Mrunmayi Mungekar, Sanjith Menon, M. Ravi Shankar, M. Khalid Jawed
Abstract:
We present a simple, accessible method for autonomously transforming flat plastic sheets into intricate three-dimensional structures using only uniform heating and common tools such as household ovens and scissors. Our approach combines heat-shrinkable thermoplastics with Kirigami patterns tailored to the target 3D shape, creating bilayer composites that morph into a wide range of complex structures, e.g., bowls, pyramids, and even custom ergonomic surfaces like mouse covers. Critically, the transformation is driven by a low-information stimulus (uniform heat) yet produces highly intricate shapes through programmed geometric design. The morphing behavior, confirmed by finite element simulations, arises from strain mismatch between the contracting thermoplastic layer and the constraining Kirigami layer. By decoupling material composition from mechanical response, this method avoids detailed process control and enables a broad class of self-morphing structures, offering a versatile platform for adaptive design and scalable manufacturing.
Chinese: 本研究提出了一种简便方法,通过均匀加热和日常工具,利用热收缩热塑性塑料和定制剪纸图案将平面塑料片转化为复杂三维结构,实现了无需精细过程控制的多样化、可扩展自变形设计。
English: This study introduces a straightforward method using uniform heating and common tools to transform flat plastic sheets into complex 3D structures through heat-shrinkable thermoplastics and tailored Kirigami patterns, enabling versatile and scalable self-morphing designs without detailed process control.
Authors:Hang Xu, Jie Huang, Linjiang Huang, Dong Li, Yidi Liu, Feng Zhao
Abstract:
Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{https://github.com/xuhang07/FreeDNA}{https://github.com/xuhang07/FreeDNA}.
Chinese: 本研究提出了一种无需训练的域噪声对齐(DNA)方法,通过在扩散采样过程中对齐域间的噪声统计特性,有效增强了基于扩散的密集预测模型的域适应能力,无需额外训练即可提升多种任务的性能。
English: This study introduces a training-free Domain Noise Alignment (DNA) method that enhances domain adaptation in diffusion-based dense prediction models by aligning noise statistics between domains during the sampling process, effectively improving performance across various tasks without requiring additional training.
Authors:Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
Abstract:
In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.
中文摘要:AgentStealth是一种基于本地部署小型语言模型的文本匿名化框架,通过对抗性学习和强化训练,在保护隐私的同时显著提升了数据可用性。
English Summary: AgentStealth is a novel framework using locally deployed small language models to anonymize text, achieving superior privacy protection and utility through adversarial learning and reinforcement techniques.
Authors:Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Abstract:
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
Chinese: 本文提出了一种弱监督的二元目标分割方法,利用图像级目标存在标签并通过将分割目标与聚类背景融合生成反事实图像,在声纳等专业领域和自然图像上实现了优于无监督基线的性能,且无需预训练网络或对抗训练。
English: This paper introduces a weakly supervised method for binary object segmentation that uses image-level object presence labels and creates counterfactual images by blending segmented objects with clustered backgrounds, achieving improved performance on specialized domains like sonar and natural images without relying on pretrained networks or adversarial training.
Authors:Hassan Baker, Austin J. Brockmeier
Abstract:
Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{https://github.com/bakerhassan/Patch2Loc}{GitHub page}.
中文: 本研究提出Patch2Loc这一无监督方法,通过训练神经网络预测脑部MRI图像中斑块的位置,并利用预测误差识别病变区域,在性能上超越了现有最先进的无监督分割技术。
English: This study introduces Patch2Loc, an unsupervised method that detects brain lesions in MRI by training a neural network to predict patch locations and identifying abnormalities through prediction errors, outperforming current state-of-the-art techniques.
Authors:Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu
Abstract:
Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at https://github.com/zgg2577/VS-KC.
中文: 本研究提出OR-VSKC数据集,包含手术室安全违规场景的合成图像和人工标注图像,旨在解决多模态大语言模型的视觉-语义知识冲突问题,实验表明微调能提升模型对已知违规实体的检测能力,但对未训练实体类型的泛化能力仍显不足。
English: This study introduces the OR-VSKC dataset, comprising synthetic and human-annotated images of operating room safety violations, to address visual-semantic knowledge conflicts in multimodal large language models, improving their detection of trained risks while revealing limitations in generalizing to unseen entities.
Authors:Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Muhammad Ali Jamshed, John M. Cioffi
Abstract:
Modern 5G/6G deployments routinely face cross-configuration handovers--users traversing cells with different antenna layouts, carrier frequencies, and scattering statistics--which inflate channel-prediction NMSE by $37.5\%$ on average when models are naively fine-tuned. The proposed improvement frames this mismatch as a continual-learning problem and benchmarks three adaptation families: replay with loss-aware reservoirs, synaptic-importance regularization, and memory-free learning-without-forgetting. Across three representative 3GPP urban micro scenarios, the best replay and regularization schemes cut the high-SNR error floor by up to 2~dB ($\approx 35\%$), while even the lightweight distillation recovers up to $30\%$ improvement over baseline handover prediction schemes. These results show that targeted rehearsal and parameter anchoring are essential for handover-robust CSI prediction and suggest a clear migration path for embedding continual-learning hooks into current channel prediction efforts in 3GPP--NR and O-RAN. The full codebase can be found at https://github.com/ahmd-mohsin/continual-learning-channel-prediction.git.
中文摘要:现代5G/6G网络在跨配置切换时会产生显著信道预测误差,而采用基于回放和正则化的持续学习方法,相比基准方案可将误差降低高达35%。
English Summary: Modern 5G/6G networks experience significant channel prediction errors during cross-configuration handovers, but the proposed continual-learning approach using replay and regularization methods reduces these errors by up to 35% compared to baseline schemes.
Authors:Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Abstract:
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.
中文: 本文提出MoDiff这一创新框架,通过调制量化和误差补偿加速扩散模型,在训练后量化中将激活位从8位降至3位且不损失性能。
English: This paper introduces MoDiff, a novel framework that accelerates diffusion models through modulated quantization and error compensation, effectively reducing activation bits from 8 to 3 without performance loss in post-training quantization.
Authors:Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
Abstract:
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on https://github.com/ufal/tsd2025-gec.
中文: 我们推出了一种基于Transformer架构的捷克语语法纠错系统,通过实时合成错误生成流水线和全面实验,在性能和效率上均优于现有模型。
English: We introduce a state-of-the-art Czech grammar error correction system using a Transformer-based neural network, featuring a real-time synthetic error generation pipeline and comprehensive experiments that outperform existing models in both performance and efficiency.
Authors:Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei
Abstract:
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop OptScale, a practical algorithm that dynamically determines the optimal number of sampled responses. OptScale employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that OptScale significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.
中文摘要:本文提出了OptScale概率框架,首次从理论上推导出实现目标推理性能所需的最小采样数量,在显著降低计算开销的同时保持最优性能,为LLM的高效推理提供了原理性指导。
English Summary: This paper introduces OptScale, a probabilistic framework that theoretically determines the minimum number of parallel samples needed for efficient inference-time scaling in LLMs, significantly reducing computational overhead while maintaining state-of-the-art reasoning performance.
Authors:Adiba Ejaz, Elias Bareinboim
Abstract:
Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step: rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a \(10\)-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior assumptions, while correcting these assumptions when contradicted by the data. Finally, LGES can exploit interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit from observational and interventional data, even with misspecified prior assumptions. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified assumptions. Our code is available at https://github.com/CausalAILab/lges.
Chinese: 较少贪心等价搜索(LGES)是GES的改进版本,通过避免不必要的边插入来加速因果发现,在保持理论保证的同时实现了高达10倍的速度提升和更高的准确性。
English: Less Greedy Equivalence Search (LGES) is an improved variant of GES that speeds up causal discovery by avoiding unnecessary edge insertions, achieving up to a 10-fold acceleration and higher accuracy while maintaining theoretical guarantees.
Authors:Yuliang Huang, Imraj Singh, Thomas Joyce, Kris Thielemans, Jamie R. McClelland
Abstract:
3D Cone-Beam CT (CBCT) is widely used in radiotherapy but suffers from motion artifacts due to breathing. A common clinical approach mitigates this by sorting projections into respiratory phases and reconstructing images per phase, but this does not account for breathing variability. Dynamic CBCT instead reconstructs images at each projection, capturing continuous motion without phase sorting. Recent advancements in 4D Gaussian Splatting (4DGS) offer powerful tools for modeling dynamic scenes, yet their application to dynamic CBCT remains underexplored. Existing 4DGS methods, such as HexPlane, use implicit motion representations, which are computationally expensive. While explicit low-rank motion models have been proposed, they lack spatial regularization, leading to inconsistencies in Gaussian motion. To address these limitations, we introduce a free-form deformation (FFD)-based spatial basis function and a deformation-informed framework that enforces consistency by coupling the temporal evolution of Gaussian's mean position, scale, and rotation under a unified deformation field. We evaluate our approach on six CBCT datasets, demonstrating superior image quality with a 6x speedup over HexPlane. These results highlight the potential of deformation-informed 4DGS for efficient, motion-compensated CBCT reconstruction. The code is available at https://github.com/Yuliang-Huang/DIGS.
中文: 针对三维锥束CT中呼吸运动导致的伪影问题,本研究提出基于自由形变的4D高斯溅射框架,通过统一高斯运动场实现运动补偿重建,在提升图像质量的同时较现有方法提速六倍。
English: To address motion artifacts in 3D CBCT caused by breathing variability, this study introduces a deformation-informed 4D Gaussian Splatting framework that uses free-form deformation to unify Gaussian motion, achieving higher image quality and a sixfold speed improvement over existing methods.
Authors:Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
Abstract:
Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
中文摘要:本研究探讨视觉语言模型是否像人类一样利用场景上下文进行物体指代,发现模型会根据物体与场景的语义关联度及干扰程度自适应地依赖上下文,其注意力机制能动态平衡局部与整体信息。
English Summary: This study investigates whether Vision-Language Models utilize scene context for object reference like humans do, finding they adaptively rely on context based on object-scene congruence and noise levels, with attention mechanisms dynamically balancing local and contextual information.
Authors:Evgeny Dedov
Abstract:
Efficiently ranking relevant items from large candidate pools is a cornerstone of modern information retrieval systems -- such as web search, recommendation, and retrieval-augmented generation. Listwise rerankers, which improve relevance by jointly considering multiple candidates, are often limited in practice: either by model input size constraints, or by degraded quality when processing large sets. We propose a model-agnostic method for fast reranking large sets that exceed a model input limits. The method first partitions candidate items into overlapping blocks, each of which is ranked independently in parallel. Implicit pairwise comparisons are then derived from these local rankings. Finally, these comparisons are aggregated to construct a global ranking using algorithms such as Winrate or PageRank. Experiments on TREC DL-2019 show that our method achieves an nDCG@10 of 70.88 compared to the 57.68 for full-context listwise approach using gpt-4.1-mini as long-context model, while reducing latency from 21 to 8 seconds.
The implementation of the algorithm and the experiments is available in the repository: https://github.com/V3RGANz/jointrank
中文摘要:本文提出了一种模型无关的大规模重排方法,通过将候选集划分为重叠块进行并行排序,从中推导隐式成对比较,并聚合生成全局排名,相比全上下文方法在TREC DL-2019上实现了更高的nDCG@10指标并显著降低了延迟。
English Summary: This paper introduces a model-agnostic method for efficient large-scale reranking by partitioning candidates into overlapping blocks for parallel ranking, deriving implicit pairwise comparisons, and aggregating them into a global ranking, achieving superior nDCG@10 scores with reduced latency compared to full-context approaches.
Authors:Ajay Mittal, Raghav Mehta, Omar Todd, Philipp Seeböck, Georg Langs, Ben Glocker
Abstract:
Automatic detection and classification of Cardiovascular disease (CVD) from Computed Tomography (CT) images play an important part in facilitating better-informed clinical decisions. However, most of the recent deep learning based methods either directly work on raw CT data or utilize it in pair with anatomical cardiac structure segmentation by training an end-to-end classifier. As such, these approaches become much more difficult to interpret from a clinical perspective. To address this challenge, in this work, we break down the CVD classification pipeline into three components: (i) image segmentation, (ii) image registration, and (iii) downstream CVD classification. Specifically, we utilize the Atlas-ISTN framework and recent segmentation foundational models to generate anatomical structure segmentation and a normative healthy atlas. These are further utilized to extract clinically interpretable radiomic features as well as deformation field based geometric features (through atlas registration) for CVD classification. Our experiments on the publicly available ASOCA dataset show that utilizing these features leads to better CVD classification accuracy (87.50\%) when compared against classification model trained directly on raw CT images (67.50\%). Our code is publicly available: https://github.com/biomedia-mira/grc-net
中文: 本研究提出一个三阶段心血管疾病分类流程,通过分割和图谱配准提取可解释的影像组学与几何特征,在ASOCA数据集上达到87.50%的准确率,显著优于直接使用原始CT图像的67.50%。
English: This study introduces a three-component pipeline for cardiovascular disease classification that uses interpretable radiomic and geometric features derived from segmentation and atlas registration, achieving 87.50% accuracy on the ASOCA dataset compared to 67.50% with raw CT images.
Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang
Abstract:
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL's computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filter-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame not only enables fine-grained categorization of training samples for deeper insight into their contributions, but also introduces an efficient and precise mechanism for entropy control, which is critical for balancing exploration and convergence in RL training. Our code is available at https://github.com/597358816/EFRame.
中文: EFRame框架通过增强探索、过滤低质量样本和重放关键经验,提升了GRPO在复杂推理任务中的性能,在Geometry3K基准上相对改进37.9%,为语言模型的深度推理提供了稳健解决方案。
English: EFRame, an Exploration-Filter-Replay framework, enhances GRPO by enabling targeted exploration, stabilizing training through sample filtering, and amplifying rare trajectories, achieving a 37.9% improvement on Geometry3K and advancing reasoning in LLMs.
Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang
Abstract:
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.
中文: EFRame框架通过增强探索、过滤低质量样本和重放关键经验,提升了GRPO在复杂推理任务中的性能,在Geometry3K基准上相对改进37.9%,为语言模型的深度推理提供了稳健解决方案。
English: EFRame, an Exploration-Filter-Replay framework, enhances GRPO by enabling targeted exploration, stabilizing training through sample filtering, and amplifying rare trajectories, achieving a 37.9% improvement on Geometry3K and advancing reasoning in LLMs.
Authors:Ronald Fecso, José Morano, Ursula Schmidt-Erfurth, Hrvoje BogunoviÄ
Abstract:
The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at https://github.com/ronnief1/RetFiner.
Chinese: RetFiner是一种自监督视觉语言优化方案,通过利用文本数据增强现有OCT基础模型,在无需监督微调的情况下显著提升了多种视网膜疾病分类任务的表现。
English: RetFiner is a self-supervised vision-language refinement scheme that enhances existing foundation models for OCT imaging by leveraging textual data, significantly improving their performance across diverse retinal disease classification tasks without requiring supervised fine-tuning.
Authors:Zhengyun Cheng, Ruizhe Zhang, Guanwen Zhang, Yi Xu, Xiangyang Ji, Wei Zhou
Abstract:
Higher-order tensors are well-suited for representing multi-dimensional data, such as images and videos, which typically characterize low-rank structures. Low-rank tensor decomposition has become essential in machine learning and computer vision, but existing methods like Tucker decomposition offer flexibility at the expense of interpretability. The CANDECOMP/PARAFAC (CP) decomposition provides a natural and interpretable structure, while obtaining a sparse solutions remains challenging. Leveraging the rich properties of CP decomposition, we propose a CP-based low-rank tensor function parameterized by neural networks (NN) for implicit neural representation. This approach can model the tensor both on-grid and beyond grid, fully utilizing the non-linearity of NN with theoretical guarantees on excess risk bounds. To achieve sparser CP decomposition, we introduce a variational Schatten-p quasi-norm to prune redundant rank-1 components and prove that it serves as a common upper bound for the Schatten-p quasi-norms of arbitrary unfolding matrices. For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson's trace estimator. The proposed smoothness regularization is SVD-free and avoids explicit chain rule derivations. It can serve as an alternative to Total Variation (TV) regularization in image denoising tasks and is naturally applicable to implicit neural representation. Extensive experiments on multi-dimensional data recovery tasks, including image inpainting, denoising, and point cloud upsampling, demonstrate the superiority and versatility of our method compared to state-of-the-art approaches. The code is available at https://github.com/CZY-Code/CP-Pruner.
Chinese: 本文提出了一种基于CP分解的低秩张量函数,通过神经网络参数化进行隐式神经表示,利用变分Schatten-p拟范数修剪冗余成分和谱范数正则化提升平滑性,在多维数据恢复任务中优于现有先进方法。
English: This paper introduces a CP-based low-rank tensor function using neural networks for implicit neural representation, enhancing sparsity through variational Schatten-p quasi-norm pruning and smoothness via spectral norm regularization, outperforming state-of-the-art methods in multi-dimensional data recovery tasks.
Authors:Noora Sassali, Roel Pieters
Abstract:
Pointing gestures are a common interaction method used in Human-Robot Collaboration for various tasks, ranging from selecting targets to guiding industrial processes. This study introduces a method for localizing pointed targets within a planar workspace. The approach employs pose estimation, and a simple geometric model based on shoulder-wrist extension to extract gesturing data from an RGB-D stream. The study proposes a rigorous methodology and comprehensive analysis for evaluating pointing gestures and target selection in typical robotic tasks. In addition to evaluating tool accuracy, the tool is integrated into a proof-of-concept robotic system, which includes object detection, speech transcription, and speech synthesis to demonstrate the integration of multiple modalities in a collaborative application. Finally, a discussion over tool limitations and performance is provided to understand its role in multimodal robotic systems. All developments are available at: https://github.com/NMKsas/gesture_pointer.git.
中文摘要:本研究提出了一种基于RGB-D数据流、通过姿态估计和肩腕几何模型定位人机协作中指向目标的方法,并通过集成多模态的机器人系统进行了全面评估与验证。
English Summary: This study presents a method for localizing pointed targets in human-robot collaboration using pose estimation and geometric modeling from RGB-D data, with comprehensive evaluation and multimodal integration demonstrated through a robotic system.
Authors:Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
Abstract:
Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly -- an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at https://github.com/hjk92g/TPM-FSS.
中文: TPM模型通过绑定前景和背景的原型位置,支持多类别分割和自适应阈值,有效解决了医学图像中因患者和器官差异导致的背景多变问题。
English: The Tied Prototype Model (TPM) improves upon ADNet by introducing tied prototypes for foreground and background, enabling multi-class segmentation and adaptive thresholds to address variability in medical images.
Authors:Zipei Ma, Junzhe Jiang, Yurui Chen, Li Zhang
Abstract:
The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
中文: 本文提出BézierGS方法,通过可学习的贝塞尔曲线建模动态物体运动轨迹,无需依赖精确位姿标注即可实现精确街景重建,在自动驾驶基准测试中展现出优越性能。
English: This paper introduces BézierGS, a method that uses learnable Bézier curves to model dynamic object motion, enabling accurate street scene reconstruction without relying on precise pose annotations and demonstrating superior performance on autonomous driving benchmarks.
Authors:Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Abstract:
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
中文摘要:本文提出梯度保持激活缩放(GPAS)方法,通过缩放中间激活值同时保持梯度不变,解决Pre-LayerNorm Transformer中激活方差指数增长问题,在不同规模模型中均实现了性能提升。
English Summary: The paper introduces Gradient-Preserving Activation Scaling (GPAS), a technique that addresses the exponential activation variance growth in Pre-LayerNorm Transformers by scaling down intermediate activations while preserving gradients, achieving consistent performance improvements across various model sizes.
Authors:Hong Nie, Fuyuan Cao, Lu Chen, Fengxin Chen, Yuefeng Zou, Jun Yu
Abstract:
Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textit{https://github.com/gme-hong/FIAG}.
中文摘要:提出的FIAG框架通过共享的全局场和运动场,仅需少量训练数据即可实现高效身份适配,突破了传统说话头合成的局限性。
English Summary: The proposed FIAG framework overcomes the limitations of traditional talking head synthesis by enabling efficient identity adaptation with minimal training data through shared global and motion fields.
Authors:Lu Han, Yu Liu, Qiwen Deng, Jian Jiang, Yinbo Sun, Zhe Yu, Binfeng Wang, Xingyu Lu, Lintao Ma, Han-Jia Ye, De-Chuan Zhan
Abstract:
Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates--such as categorical variables and multimodal data (e.g., images, text)--which are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of TSFMs.Extensive experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Codes are released on https://github.com/hanlu-nju/UniCA.
中文: 时间序列基础模型虽在大规模预训练中表现出色,但难以处理多样化协变量,为此提出的UniCA框架通过同质化处理和融合机制,有效整合异构数据以提升预测性能,同时保持模型的泛化能力。
English: Time Series Foundation Models (TSFMs) excel in large-scale pretraining but struggle with diverse covariates, prompting the development of UniCA, a framework that homogenizes and fuses heterogeneous data to enhance forecasting while preserving TSFMs' generalization.
Authors:Ossi Parikka, Roel Pieters
Abstract:
Modern industry is increasingly moving away from mass manufacturing, towards more specialized and personalized products. As manufacturing tasks become more complex, full automation is not always an option, human involvement may be required. This has increased the need for advanced human robot collaboration (HRC), and with it, improved methods for interaction, such as voice control. Recent advances in natural language processing, driven by artificial intelligence (AI), have the potential to answer this demand. Large language models (LLMs) have rapidly developed very impressive general reasoning capabilities, and many methods of applying this to robotics have been proposed, including through the use of code generation. This paper presents Language Model Program Voice Control (LMPVC), an LLM-based prototype voice control architecture with integrated policy programming and teaching capabilities, built for use with Robot Operating System 2 (ROS2) compatible robots. The architecture builds on prior works using code generation for voice control by implementing an additional programming and teaching system, the Policy Bank. We find this system can compensate for the limitations of the underlying LLM, and allow LMPVC to adapt to different downstream tasks without a slow and costly training process. The architecture and additional results are released on GitHub (https://github.com/ozzyuni/LMPVC).
中文摘要:现代工业向个性化生产转型需要先进的人机协作,因此开发了LMPVC这一基于人工智能的语音控制系统,它集成了编程与教学功能,可实现机器人任务的自适应调整。
English Summary: Modern industry's shift towards personalized production requires advanced human-robot collaboration, leading to the development of LMPVC, an AI-powered voice control system that integrates programming and teaching capabilities for adaptable robotics applications.
Authors:Han Wang, Shengyang Li, Jian Yang, Yuxuan Liu, Yixuan Lv, Zhuang Zhou
Abstract:
Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning to pre-train on large-scale optical-SAR image pairs, ensuring the model's ability to extract modality-invariant features. Our dataset and baseline method are publicly available on https://github.com/Alioth2000/Hoss-ReID.
中文摘要:HOSS ReID数据集通过融合光学与合成孔径雷达卫星影像,解决了现有船舶追踪技术受天气和覆盖范围限制的问题,同时提出的TransOSS基准方法改进了视觉Transformer架构,利用对比学习提升跨模态船舶重识别能力。
English Summary: The HOSS ReID dataset addresses limitations in maritime ship tracking by combining optical and SAR satellite imagery for all-weather monitoring with shorter re-imaging cycles, while the TransOSS baseline method enhances cross-modal ship re-identification through modified Vision Transformer architecture and contrastive learning.
Authors:Qi Gao, Zhihao Chen, Dong Zeng, Junping Zhang, Jianhua Ma, Hongming Shan
Abstract:
The generalization of deep learning-based low-dose computed tomography (CT) reconstruction models to doses unseen in the training data is important and remains challenging. Previous efforts heavily rely on paired data to improve the generalization performance and robustness through collecting either diverse CT data for re-training or a few test data for fine-tuning. Recently, diffusion models have shown promising and generalizable performance in low-dose CT (LDCT) reconstruction, however, they may produce unrealistic structures due to the CT image noise deviating from Gaussian distribution and imprecise prior information from the guidance of noisy LDCT images. In this paper, we propose a noise-inspired diffusion model for generalizable LDCT reconstruction, termed NEED, which tailors diffusion models for noise characteristics of each domain. First, we propose a novel shifted Poisson diffusion model to denoise projection data, which aligns the diffusion process with the noise model in pre-log LDCT projections. Second, we devise a doubly guided diffusion model to refine reconstructed images, which leverages LDCT images and initial reconstructions to more accurately locate prior information and enhance reconstruction fidelity. By cascading these two diffusion models for dual-domain reconstruction, our NEED requires only normal-dose data for training and can be effectively extended to various unseen dose levels during testing via a time step matching strategy. Extensive qualitative, quantitative, and segmentation-based evaluations on two datasets demonstrate that our NEED consistently outperforms state-of-the-art methods in reconstruction and generalization performance. Source code is made available at https://github.com/qgao21/NEED.
中文摘要:本文提出NEED模型,一种针对低剂量CT重建的噪声启发扩散方法,通过移位泊松扩散处理投影数据噪声和双重引导扩散优化重建图像,仅需常规剂量训练数据即可泛化至不同未知剂量水平,并在重建性能上超越现有技术。
English Summary: The paper introduces NEED, a noise-inspired diffusion model for generalizable low-dose CT reconstruction that uses a shifted Poisson diffusion for projection denoising and a doubly guided diffusion for image refinement, requiring only normal-dose training data and outperforming existing methods across unseen dose levels.
Authors:Tianyu Zhang, Xin Luo, Li Li, Dong Liu
Abstract:
Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at https://github.com/LuizScarlet/StableCodec.
中文:StableCodec提出了一种单步扩散方法,用于极低码率下的图像压缩,在显著提升真实感和保真度的同时,实现了与主流编码方案相当的推理速度。
English: StableCodec introduces a one-step diffusion method for extreme image compression that enhances both realism and fidelity at ultra-low bitrates, outperforming existing techniques in speed and quality metrics.
Authors:Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
Abstract:
Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs' decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please.
中文摘要:PapersPlease基准通过基于ERG理论的3700个道德困境发现,大型语言模型在扮演移民官员时表现出系统性偏见,既呈现决策偏好模式,又对边缘化身份显示更高拒绝率。
English Summary: The PapersPlease benchmark uses 3,700 moral dilemmas based on ERG theory to reveal how large language models exhibit systematic biases when role-playing as immigration inspectors, showing preferential treatment patterns and heightened denial rates for marginalized identities.
Authors:Liu Yang, Huiyu Duan, Jiarui Wang, Jing Liu, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Abstract:
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques, AI generated images (AIGIs) have attracted widespread attention, among which AI generated omnidirectional images (AIGODIs) hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications. AI generated omnidirectional images exhibit unique quality issues, however, research on the quality assessment and optimization of AI-generated omnidirectional images is still lacking. To this end, this work first studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs, and further presents a corresponding optimization process. Specifically, we first establish a comprehensive database to reflect human feedback for AI-generated omnidirectionals, termed OHF2024, which includes both subjective quality ratings evaluated from three perspectives and distortion-aware salient regions. Based on the constructed OHF2024 database, we propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images, which are named as BLIP2OIQA and BLIP2OISal, respectively. Finally, based on the proposed models, we present an automatic optimization process that utilizes the predicted visual experience scores and distortion regions to further enhance the visual quality of an AI-generated omnidirectional image. Extensive experiments show that our BLIP2OIQA model and BLIP2OISal model achieve state-of-the-art (SOTA) results in the human visual experience evaluation task and the distortion-aware saliency prediction task for AI generated omnidirectional images, and can be effectively used in the optimization process. The database and codes will be released on https://github.com/IntMeGroup/AIGCOIQA to facilitate future research.
中文: 本研究针对AI生成的全景图像,建立了OHF2024数据库并开发了BLIP2OIQA和BLIP2OISal模型,在人类视觉体验评估和失真感知显著性预测方面取得领先性能,有效提升了图像视觉质量优化效果。
English: This study addresses the quality assessment and optimization of AI-generated omnidirectional images (AIGODIs) by establishing the OHF2024 database and developing BLIP2OIQA and BLIP2OISal models, which achieve state-of-the-art performance in evaluating human visual experience and predicting distortion-aware saliency for enhanced visual quality.
Authors:Juming Xiong, Ruining Deng, Jialin Yue, Siqi Lu, Junlin Guo, Marilyn Lionts, Tianyuan Yao, Can Cui, Junchao Zhu, Chongyu Qu, Mengmeng Yin, Haichun Yang, Yuankai Huo
Abstract:
Histological analysis plays a crucial role in understanding tissue structure and pathology. While recent advancements in registration methods have improved 2D histological analysis, they often struggle to preserve critical 3D spatial relationships, limiting their utility in both clinical and research applications. Specifically, constructing accurate 3D models from 2D slices remains challenging due to tissue deformation, sectioning artifacts, variability in imaging techniques, and inconsistent illumination. Deep learning-based registration methods have demonstrated improved performance but suffer from limited generalizability and require large-scale training data. In contrast, non-deep-learning approaches offer better generalizability but often compromise on accuracy. In this study, we introduced ZeroReg3D, a novel zero-shot registration pipeline tailored for accurate 3D reconstruction from serial histological sections. By combining zero-shot deep learning-based keypoint matching with optimization-based affine and non-rigid registration techniques, ZeroReg3D effectively addresses critical challenges such as tissue deformation, sectioning artifacts, staining variability, and inconsistent illumination without requiring retraining or fine-tuning. The code has been made publicly available at https://github.com/hrlblab/ZeroReg3D
Chinese: 本研究提出ZeroReg3D这一零样本配准流程,结合深度学习关键点匹配与优化技术,无需重新训练即可解决组织变形和成像差异问题,实现从二维组织切片到三维模型的精准重建。
English: This study introduces ZeroReg3D, a zero-shot registration pipeline that combines deep learning keypoint matching with optimization techniques to accurately reconstruct 3D models from 2D histological sections while addressing tissue deformation and imaging inconsistencies without requiring retraining.
Authors:Justin Reinman, Sunwoong Choi
Abstract:
CERBERUS is a synthetic benchmark designed to help train and evaluate AI models for detecting cracks and other defects in infrastructure. It includes a crack image generator and realistic 3D inspection scenarios built in Unity. The benchmark features two types of setups: a simple Fly-By wall inspection and a more complex Underpass scene with lighting and geometry challenges. We tested a popular object detection model (YOLO) using different combinations of synthetic and real crack data. Results show that combining synthetic and real data improves performance on real-world images. CERBERUS provides a flexible, repeatable way to test defect detection systems and supports future research in automated infrastructure inspection. CERBERUS is publicly available at https://github.com/justinreinman/Cerberus-Defect-Generator.
中文:CERBERUS是一个用于训练和评估基础设施缺陷检测AI模型的合成基准,包含裂缝图像生成器和3D场景,结合真实数据可提升性能,并已公开可用。
English: CERBERUS is a synthetic benchmark for training and evaluating AI models in infrastructure defect detection, featuring a crack image generator and 3D scenarios that improve performance when combined with real data, and it is publicly accessible.
Authors:Mingquan Liu
Abstract:
Fine Grained Visual Categorization (FGVC) remains a challenging task in computer vision due to subtle inter class differences and fragile feature representations. Existing methods struggle in fine grained scenarios, especially when labeled data is scarce. We propose a semi supervised method combining Mamba based feature modeling, region attention, and Bayesian uncertainty. Our approach enhances local to global feature modeling while focusing on key areas during learning. Bayesian inference selects high quality pseudo labels for stability. Experiments show strong performance on FGVC benchmarks with occlusions, demonstrating robustness when labeled data is limited. Code is available at https://github.com/wxqnl/RAUM Net.
中文: 本文提出一种半监督方法,结合Mamba特征建模、区域注意力和贝叶斯不确定性,通过增强局部到全局特征表示并筛选高质量伪标签,在标注数据有限的情况下显著提升了细粒度视觉分类的鲁棒性。
English: This paper introduces a semi-supervised method that integrates Mamba-based feature modeling, region attention, and Bayesian uncertainty to improve fine-grained visual categorization by enhancing local-to-global feature representation and selecting high-quality pseudo-labels, demonstrating robust performance on benchmarks with limited labeled data.
Authors:Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo
Abstract:
In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at https://github.com/PanasonicConnect/DIVE
中文摘要:我们的DIVE方法在2025年复杂视频推理挑战赛中荣获第一,通过迭代推理对复杂视频问题进行逐步分解,在CVRR-ES基准测试中实现了81.44%的准确率。
English Summary: Our DIVE method won first place in the 2025 Complex Video Reasoning Challenge by using iterative reasoning to achieve 81.44% accuracy on the CVRR-ES benchmark through stepwise decomposition of complex video questions.
Authors:Yuansheng Li, Yunhao Zou, Linwei Chen, Ying Fu
Abstract:
Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement: 1) the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.The code and are available at https://github.com/bit1120203554/IHRUT.
中文摘要:本研究提出了一种新型干涉高光谱成像重建流程,通过建立基于物理的精确退化模型生成训练数据,并设计专用的IHRUT变换器网络,有效解决了训练数据匮乏问题,实现了卓越的光谱空间重建性能。
English Summary: This study introduces a novel Interferometric Hyperspectral Imaging (IHI) reconstruction pipeline that combines a physics-based degradation model for realistic dataset synthesis with a specialized transformer network (IHRUT) to overcome training data limitations and achieve enhanced spectral-spatial reconstruction.
Authors:Yanguang Sun, Jiexi Yan, Jianjun Qian, Chunyan Xu, Jian Yang, Lei Luo
Abstract:
Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https://github.com/CSYSI/DPU-Former.
中文: 提出的DPU-Former模型通过全局-局部混合注意力等创新组件,有效融合了卷积和Transformer特征的远程依赖与空间细节,在光学遥感图像分割任务中实现了最优性能。
English: The proposed DPU-Former model effectively integrates long-range dependencies and spatial details from both convolutional and Transformer features through innovative components like global-local mixed attention and a gated feed-forward network, achieving superior segmentation performance on optical remote sensing images.
Authors:Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Abstract:
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
中文: 原生多模态大语言模型将语音与文本生成整合于单一模型内,保留了副语言特征并降低了延迟,但受限于配对数据不足导致性能下降;DeepTalk通过自适应模态专家学习框架,显著减少了性能损失并保持了流畅的交互体验。
English: Native multimodal large language models integrate speech and text generation directly within a single model, preserving paralinguistic features and reducing latency, but face performance issues due to limited paired data, which DeepTalk addresses through adaptive modality expert learning to minimize performance degradation and maintain efficient interaction.
Authors:Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
Abstract:
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.
Chinese: 本文提出LLaVA-Scissor,一种无需训练的视频多模态令牌压缩方法,通过语义连通组件实现全面语义覆盖,在多种视频理解基准测试中表现出优越性能。
English: This paper introduces LLaVA-Scissor, a training-free token compression method for video multimodal models that uses Semantic Connected Components to achieve comprehensive semantic coverage and superior performance across video understanding benchmarks.
Authors:Jiho Choi, Sang Jun Lee
Abstract:
In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at https://github.com/ziiho08/Periodic-MAE.
中文: 本文提出一种通过视频掩码自编码器的自监督框架,从无标签面部视频中学习周期性生理信号,结合生理频带约束与帧掩码技术,在跨数据集rPPG评估中取得显著性能提升。
English: This paper introduces a self-supervised framework using video masked autoencoders to learn periodic facial signals from unlabeled videos, achieving superior cross-dataset rPPG estimation performance through physiological constraints and frame masking techniques.
Authors:Kunjal Panchal, Sunav Choudhary, Yuriy Brun, Hui Guan
Abstract:
Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing outperforms FmAD and ZO variants, including those enhanced with variance reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8x fewer computations at comparable memory usage. Our results highlight fundamental limitations of FmAD and ZO, and reaffirm BP with checkpointing as the most effective strategy for model training under memory-constrained settings. Our code is available at https://github.com/Astuary/The_Cost_of_Avoiding_Backpropagation.
中文: 研究表明,尽管前向自动微分和零阶优化能降低内存使用,但与采用检查点的反向传播相比,它们在精度、收敛速度和计算效率上存在显著不足,后者仍是内存受限场景下最优的训练策略。
English: This study demonstrates that while forward-mode automatic differentiation and zero-order optimization reduce memory usage, they significantly compromise accuracy, convergence speed, and computational efficiency compared to backpropagation with checkpointing, which remains superior for memory-constrained model training.
Authors:Rafael Sterzinger, Marco Peer, Robert Sablatnig
Abstract:
As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: https://github.com/RafaelSterzinger/few-shot-map-segmentation.
Chinese: 本研究提出了一种简单有效的历史地图少样本分割方法,通过结合大型视觉基础模型与参数高效微调,在基准数据集上实现了最优性能,同时大幅减少了对标注数据和可训练参数的需求。
English: This study introduces a simple yet effective few-shot segmentation method for historical maps by combining large vision foundation models with parameter-efficient fine-tuning, achieving state-of-the-art performance on benchmark datasets while requiring minimal annotated data and trainable parameters.
Authors:Fuying Wang, Jiacheng Xu, Lequan Yu
Abstract:
Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision-at the token, beat, and rhythm levels-to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available at https://github.com/HKU-MedAI/MELP.
中文:MELP提出了一种新颖的多尺度心电-语言预训练模型,通过分层跨模态监督捕捉结构化心电信号与文本的对齐关系,在多种临床任务中优于现有自监督方法。
English: MELP introduces a novel multi-scale ECG-language pretraining model that leverages hierarchical cross-modal supervision to capture structured ECG-text alignments, outperforming existing self-supervised methods across diverse clinical tasks.
Authors:Yifan Liu, Xishun Liao, Haoxuan Ma, Jonathan Liu, Rohan Jadhav, Jiaqi Ma
Abstract:
Understanding and modeling human mobility patterns is crucial for effective transportation planning and urban development. Despite significant advances in mobility research, there remains a critical gap in simulation platforms that allow for algorithm development, policy implementation, and comprehensive evaluation at scale. Traditional activity-based models require extensive data collection and manual calibration, machine learning approaches struggle with adaptation to dynamic conditions, and treding agent-based Large Language Models (LLMs) implementations face computational constraints with large-scale simulations. To address these challenges, we propose MobiVerse, a hybrid framework leverages the efficiency of lightweight domain-specific generator for generating base activity chains with the adaptability of LLMs for context-aware modifications. A case study was conducted in Westwood, Los Angeles, where we efficiently generated and dynamically adjusted schedules for the whole population of approximately 53,000 agents on a standard PC. Our experiments demonstrate that MobiVerse successfully enables agents to respond to environmental feedback, including road closures, large gathering events like football games, and congestion, through our hybrid framework. Its modular design facilitates testing various mobility algorithms at both transportation system and agent levels. Results show our approach maintains computational efficiency while enhancing behavioral realism. MobiVerse bridges the gap in mobility simulation by providing a customizable platform for mobility systems planning and operations with benchmark algorithms. Code and videos are available at https://github.com/ucla-mobility/MobiVerse.
中文:MobiVerse是一个结合高效领域专用生成器与适应性大语言模型的混合框架,通过西洛杉矶案例研究证明其能在保持计算效率的同时,实现大规模人群移动的逼真模拟和环境响应,为交通系统规划提供可定制平台。
English: MobiVerse is a hybrid framework combining efficient domain-specific generators with adaptable Large Language Models to enable scalable and realistic human mobility simulations, addressing computational and adaptability limitations in existing approaches while maintaining efficiency.
Authors:Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Abstract:
Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at https://github.com/elixir-research-group/TLQA.
中文: 大语言模型在时间理解和列表构建方面存在明显不足,为此提出的TLQA基准测试揭示了当前模型在闭卷和开放域设置中的重大缺陷,为未来研究指明了方向。
English: Large Language Models struggle with temporal understanding and list construction in question answering, leading to the creation of the TLQA benchmark which reveals significant model shortcomings in both closed-book and open-domain settings.
Authors:Tianrong Chen, Huangjie Zheng, David Berthelot, Jiatao Gu, Josh Susskind, Shuangfei Zhai
Abstract:
Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to $186\%$ faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at https://github.com/apple/ml-tada.
中文摘要:本文提出一种无需训练的采样方法,通过使用高维初始噪声和常微分方程求解器,将扩散模型的图像生成速度最高提升186%,并在多种预训练模型上实现优异性能且不增加计算成本。
English Summary: This paper introduces a training-free sampling method that accelerates diffusion model image generation by up to 186% through higher-dimensional initial noise and ODE solver techniques, achieving superior performance across multiple pretrained models without additional computational cost.
Authors:Tianrong Chen, Huangjie Zheng, David Berthelot, Jiatao Gu, Josh Susskind, Shuangfei Zhai
Abstract:
Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to $186\%$ faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at https://github.com/apple/ml-tada.
中文摘要:本文提出一种无需训练的采样方法,通过使用高维初始噪声和常微分方程求解器,将扩散模型的图像生成速度最高提升186%,并在多种预训练模型上实现优异性能且不增加计算成本。
English Summary: This paper introduces a training-free sampling method that accelerates diffusion model image generation by up to 186% through higher-dimensional initial noise and ODE solver techniques, achieving superior performance across multiple pretrained models without additional computational cost.
Authors:Eivind Morris Bakke, Nora Winger Heggelund
Abstract:
Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1's parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as "Not Enough Evidence". Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50\% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: https://github.com/eibakke/FEVER-8-Shared-Task
中文: 本研究探讨了Llama 3.1模型的参数知识偏差对HerO事实核查系统的影响,发现直接提示会导致大量“证据不足”判定,而人为注入的偏差虽显著改变证据检索结果,却未对最终判定产生实质性影响。
English: This study examines how parametric knowledge biases in Llama 3.1 affect the HerO fact-verification system, revealing that direct prompting leads to high rates of "Not Enough Evidence" classifications while injected bias significantly alters evidence retrieval without substantially changing final verdicts.
Authors:Remco F. Leijenaar, Hamidreza Kasaei
Abstract:
Learning semantically meaningful representations from unstructured 3D point clouds remains a central challenge in computer vision, especially in the absence of large-scale labeled datasets. While masked point modeling (MPM) is widely used in self-supervised 3D learning, its reconstruction-based objective can limit its ability to capture high-level semantics. We propose AsymDSD, an Asymmetric Dual Self-Distillation framework that unifies masked modeling and invariance learning through prediction in the latent space rather than the input space. AsymDSD builds on a joint embedding architecture and introduces several key design choices: an efficient asymmetric setup, disabling attention between masked queries to prevent shape leakage, multi-mask sampling, and a point cloud adaptation of multi-crop. AsymDSD achieves state-of-the-art results on ScanObjectNN (90.53%) and further improves to 93.72% when pretrained on 930k shapes, surpassing prior methods.
中文:AsymDSD框架通过潜在空间预测整合掩码建模与不变性学习,在ScanObjectNN上实现90.53%的领先性能,经93万形状预训练后进一步提升至93.72%。
English: The AsymDSD framework integrates masked modeling and invariance learning through latent space prediction, achieving state-of-the-art performance on ScanObjectNN with 90.53% accuracy and 93.72% when pretrained on extensive data.
Authors:Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song
Abstract:
In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model's inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.
Chinese: 文本到文本回归作为一种可扩展且高效的替代方案,在预测如谷歌Borg等复杂系统的资源效率方面实现了近乎完美的准确性,并能以少量数据轻松适应新任务。
English: Text-to-text regression offers a scalable and effective alternative to traditional tabular methods, achieving near-perfect accuracy in predicting resource efficiency on complex systems like Google's Borg and adapting easily to new tasks with minimal data.
Authors:Oron Nir, Jay Tenenbaum, Ariel Shamir
Abstract:
Density-based clustering methods often surpass centroid-based counterparts, when addressing data with noise or arbitrary data distributions common in real-world problems. In this study, we reveal a key property intrinsic to density-based clustering methods regarding the relation between the number of clusters and the neighborhood radius of core points - we empirically show that it is nearly unimodal, and support this claim theoretically in a specific setting. We leverage this property to devise new strategies for finding appropriate values for the radius more efficiently based on the Ternary Search algorithm. This is especially important for large scale data that is high-dimensional, where parameter tuning is computationally intensive. We validate our methodology through extensive applications across a range of high-dimensional, large-scale NLP, Audio, and Computer Vision tasks, demonstrating its practical effectiveness and robustness. This work not only offers a significant advancement in parameter control for density-based clustering but also broadens the understanding regarding the relations between their guiding parameters. Our code is available at https://github.com/oronnir/UnimodalStrategies.
中文摘要:本研究揭示了密度聚类中簇数量与核心点邻域半径存在近似单峰关系,基于此提出利用三分搜索算法高效确定最佳半径参数的方法,并在大规模高维NLP、音频和视觉任务中验证了其有效性与鲁棒性。
English Summary: This study identifies a nearly unimodal relationship between the number of clusters and neighborhood radius in density-based clustering, enabling more efficient parameter tuning via Ternary Search for large-scale high-dimensional data across NLP, audio, and vision tasks.
Authors:Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Abstract:
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7\% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at https://github.com/Indolent-Kawhi/View-R1.
中文: 本研究提出非对称策略优化(APO)方法,结合难度自适应散度塑形(DADS)和次优轨迹复杂度正则化(STCR),有效提升了多模态大语言模型的推理能力,在保持泛化性能的同时显著优于基准模型及更大规模模型。
English: This research introduces Asymmetric Policy Optimization (APO) with Difficulty-Adaptive Divergence Shaping (DADS) and Suboptimal Trajectory Complexity Regularization (STCR) to enhance multimodal reasoning in MLLMs, achieving significant performance gains while maintaining generalization across tasks.
Authors:Haiping Yang, Huaxing Liu, Wei Wu, Zuohui Chen, Ning Wu
Abstract:
Unmanned aerial vehicles (UAVs) are increasingly employed in diverse applications such as land surveying, material transport, and environmental monitoring. Following missions like data collection or inspection, UAVs must land safely at docking stations for storage or recharging, which is an essential requirement for ensuring operational continuity. However, accurate landing remains challenging due to factors like GPS signal interference. To address this issue, we propose a deviation warning system for UAV landings, powered by a novel vision-based model called AeroLite-MDNet. This model integrates a multiscale fusion module for robust cross-scale object detection and incorporates a segmentation branch for efficient orientation estimation. We introduce a new evaluation metric, Average Warning Delay (AWD), to quantify the system's sensitivity to landing deviations. Furthermore, we contribute a new dataset, UAVLandData, which captures real-world landing deviation scenarios to support training and evaluation. Experimental results show that our system achieves an AWD of 0.7 seconds with a deviation detection accuracy of 98.6\%, demonstrating its effectiveness in enhancing UAV landing reliability. Code will be available at https://github.com/ITTTTTI/Maskyolo.git
中文:本文提出的AeroLite-MDNet系统通过基于视觉的偏差预警机制提升无人机着陆安全性,该系统以98.6%的检测精度和仅0.7秒的延迟实现精准预警,并配套开发了新型评估指标与专用数据集。
English: The proposed AeroLite-MDNet system enhances UAV landing safety through a vision-based deviation warning system that achieves 98.6% detection accuracy with only 0.7-second delay, supported by a new evaluation metric and dataset.
Authors:Yixin Sun, Li Li, Wenke E, Amir Atapour-Abarghouei, Toby P. Breckon
Abstract:
Detecting traversable pathways in unstructured outdoor environments remains a significant challenge for autonomous robots, especially in critical applications such as wide-area search and rescue, as well as incident management scenarios like forest fires. Existing datasets and models primarily target urban settings or wide, vehicle-traversable off-road tracks, leaving a substantial gap in addressing the complexity of narrow, trail-like off-road scenarios. To address this, we introduce the Trail-based Off-road Multimodal Dataset (TOMD), a comprehensive dataset specifically designed for such environments. TOMD features high-fidelity multimodal sensor data -- including 128-channel LiDAR, stereo imagery, GNSS, IMU, and illumination measurements -- collected through repeated traversals under diverse conditions. We also propose a dynamic multiscale data fusion model for accurate traversable pathway prediction. The study analyzes the performance of early, cross, and mixed fusion strategies under varying illumination levels. Results demonstrate the effectiveness of our approach and the relevance of illumination in segmentation performance. We publicly release TOMD at https://github.com/yyyxs1125/TMOD to support future research in trail-based off-road navigation.
中文摘要:本研究针对非结构化户外环境中可通行路径检测的难题,推出了专门设计的越野小径多模态数据集TOMD和动态多尺度融合模型,通过不同光照条件下的实验验证了方法的有效性。
English Summary: The study introduces the Trail-based Off-road Multimodal Dataset (TOMD) and a dynamic multiscale fusion model to address autonomous robots' challenges in detecting traversable pathways in unstructured outdoor environments, demonstrating improved performance under varying illumination conditions.
Authors:Chenhao Zhang, Yezhi Shen, Fengqing Zhu
Abstract:
In recent years, neural rendering methods such as NeRFs and 3D Gaussian Splatting (3DGS) have made significant progress in scene reconstruction and novel view synthesis. However, they heavily rely on preprocessed camera poses and 3D structural priors from structure-from-motion (SfM), which are challenging to obtain in outdoor scenarios. To address this challenge, we propose to incorporate Iterative Closest Point (ICP) with optimization-based refinement to achieve accurate camera pose estimation under large camera movements. Additionally, we introduce a voxel-based scene densification approach to guide the reconstruction in large-scale scenes. Experiments demonstrate that our approach ICP-3DGS outperforms existing methods in both camera pose estimation and novel view synthesis across indoor and outdoor scenes of various scales. Source code is available at https://github.com/Chenhao-Z/ICP-3DGS.
Chinese: 提出的ICP-3DGS方法结合迭代最近点算法与基于优化的位姿细化及体素密度化技术,无需预处理相机位姿即可实现更优的相机定位和新视角合成,在各类室内外场景中均超越现有方法。
English: The proposed ICP-3DGS method integrates Iterative Closest Point with optimization-based pose refinement and voxel-based densification to achieve superior camera pose estimation and novel view synthesis without relying on preprocessed camera poses, outperforming existing approaches across diverse indoor and outdoor scenes.
Authors:Junhao Liu, Zhenhao Xu, Yuxin Fang, Yichuan Chen, Zuobin Ying, Wenhan Chang
Abstract:
Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models' reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed "Aha moment") and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: https://github.com/ChangWenhan/FromThinking2Output
中文: 尽管大语言模型在复杂推理方面取得显著进展,但现有研究缺乏对其推理过程与输出的系统比较;本文通过关键词统计和LLM评估框架分析四大前沿模型,揭示了它们在推理深度与效率上的差异,并为模型优化提供了实用建议。
English: Recent advances in large language models show enhanced reasoning capabilities, yet a systematic comparison of their reasoning processes and outputs is lacking, prompting this study to analyze four top models using keyword statistics and LLM evaluation, revealing differences in reasoning depth and efficiency while offering insights for model improvement.
Authors:Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong
Abstract:
Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at https://github.com/import-myself/Membench.
中文: 本文提出了MemBench,这是一个全面的数据集和基准,旨在从多个方面评估基于LLM的智能体在不同记忆层次和交互场景下的记忆能力,解决了以往评估在多样性和指标上的不足。
English: This paper introduces MemBench, a comprehensive dataset and benchmark designed to evaluate the memory capabilities of LLM-based agents across different memory levels and interactive scenarios, addressing previous limitations in diversity and metrics.
Authors:Duong Bach
Abstract:
Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy. Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$\times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p\%$ most salient patches, reducing late-interaction computation by up to 60\% with less than 2\% nDCG@10 loss; and (3) optional binary encoding of centroid indices into $b$-bit strings ($b=\lceil\log_2 K\rceil$), enabling rapid Hamming distance-based similarity search for resource-constrained environments. Evaluated on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30--50\% lower query latency under HNSW indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation pipeline for legal summarization, it reduces hallucination rates by 30\% and halves end-to-end latency. These advancements establish HPC-ColPali as a scalable and efficient solution for multi-vector document retrieval across diverse applications. Code is available at https://github.com/DngBack/HPC-ColPali.
Chinese: HPC-ColPali提出了一种分层补丁压缩框架,通过量化、剪枝和二进制编码技术,在保持高精度的同时显著降低了多向量文档检索的存储和计算成本。
English: HPC-ColPali introduces a hierarchical patch compression framework that significantly reduces storage and computational costs for multi-vector document retrieval while maintaining high accuracy through quantization, pruning, and binary encoding techniques.
Authors:Yingzhi He, Xiaohao Liu, An Zhang, Yunshan Ma, Tat-Seng Chua
Abstract:
Sequential recommendation aims to predict users' future interactions by modeling collaborative filtering (CF) signals from historical behaviors of similar users or items. Traditional sequential recommenders predominantly rely on ID-based embeddings, which capture CF signals through high-order co-occurrence patterns. However, these embeddings depend solely on past interactions, lacking transferable knowledge to generalize to unseen domains. Recent advances in large language models (LLMs) have motivated text-based recommendation approaches that derive item representations from textual descriptions. While these methods enhance generalization, they fail to encode CF signals-i.e., latent item correlations and preference patterns-crucial for effective recommendation. We argue that an ideal embedding model should seamlessly integrate CF signals with rich semantic representations to improve both in-domain and out-of-domain recommendation performance.
To this end, we propose LLM2Rec, a novel embedding model tailored for sequential recommendation, integrating the rich semantic understanding of LLMs with CF awareness. Our approach follows a two-stage training framework: (1) Collaborative Supervised Fine-tuning, which adapts LLMs to infer item relationships based on historical interactions, and (2) Item-level Embedding Modeling, which refines these specialized LLMs into structured item embedding models that encode both semantic and collaborative information. Extensive experiments on real-world datasets demonstrate that LLM2Rec effectively improves recommendation quality across both in-domain and out-of-domain settings. Our findings highlight the potential of leveraging LLMs to build more robust, generalizable embedding models for sequential recommendation. Our codes are available at https://github.com/HappyPointer/LLM2Rec.
Chinese: 提出的LLM2Rec模型通过两阶段训练框架,将协同过滤信号与大型语言模型的语义表征相结合,有效提升了顺序推荐在领域内和跨域场景中的性能表现。
English: The proposed LLM2Rec model integrates collaborative filtering signals with semantic representations from large language models through a two-stage training framework, enhancing sequential recommendation performance in both in-domain and out-of-domain scenarios.
Authors:Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Casper Hansen, Julien Fauqueur
Abstract:
We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa - even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5\% and Text2Cypher by 73.1\%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5\%) and knowledge graph QA (CR-LT-KGQA: 1.7\%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at https://github.com/bouv/STRuCT-LLM).
中文:STRuCT-LLM是一个统一框架,通过强化学习和思维链监督联合优化Text-to-SQL与Text-to-Cypher任务,训练大语言模型对关系型和图结构数据进行结构化推理,实现了显著性能提升和强大的零样本泛化能力。
English: STRuCT-LLM is a unified framework that trains large language models for structured reasoning across relational and graph data by jointly optimizing Text-to-SQL and Text-to-Cypher tasks with reinforcement learning and Chain-of-Thought supervision, achieving significant performance improvements and strong zero-shot generalization.
Authors:Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu
Abstract:
Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns -- e.g., ``Wait, did I miss anything?'' -- consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs' cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs' cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs' CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.
中文摘要:大型推理模型(LRMs)通过CogTest基准测试展现出类人的认知习惯,这些习惯能根据不同任务自适应调整,且某些习惯(如“承担风险”)与生成有害内容密切相关。
English Summary: Large Reasoning Models (LRMs) demonstrate human-like cognitive habits that adapt to different tasks, as revealed by the CogTest benchmark, which also links certain habits to safety risks in model responses.
Authors:Baqer M. Merzah, Tania Taami, Salman Asoudeh, Saeed Mirzaee, Amir reza Hossein pour, Amir Ali Bengari
Abstract:
Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.
中文摘要:本研究提出BioPars框架用于评估大型语言模型在波斯语医学问答中的表现,结果显示其优于现有模型,同时揭示了模型在处理复杂生物医学推理方面的局限性。
English Summary: This study introduces BioPars, a framework for evaluating large language models in Persian medical question-answering, demonstrating superior performance over existing models while highlighting current limitations in handling complex biomedical reasoning.
Authors:Jiyan Liu, Youzheng Liu, Taihang Wang, Xiaoman Xu, Yimin Wang, Ye Jiang
Abstract:
This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: https://github.com/warmth27/SemEval2025_Task7.
中文: 本文介绍了QUST_NLP团队为事实核查声明检索设计的三阶段检索框架,在SemEval-2025任务7的单语和跨语种赛道中分别获得第五和第七名。
English: This paper presents QUST_NLP's three-stage retrieval framework for fact-checked claim retrieval, which achieved 5th and 7th places in SemEval-2025 Task 7's monolingual and crosslingual tracks respectively.
Authors:Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, Hao Chen
Abstract:
We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.
中文摘要:WorldVLA是一个将视觉-语言-动作模型与世界模型相融合的自回归动作世界模型,通过注意力掩码策略有效解决了动作序列生成中的误差传播问题,实现了世界建模与动作生成的相互增强。
English Summary: WorldVLA is an autoregressive action world model that integrates vision, language, and action capabilities within a unified framework, demonstrating mutual enhancement between world modeling and action generation while addressing error propagation through an attention mask strategy.
Authors:Mohammed Baharoon, Jun Ma, Congyu Fang, Augustin Toma, Bo Wang
Abstract:
Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
中文摘要:本研究系统探索了用于自动化CT报告生成的3D多模态大语言模型设计,提出的基于知识的报告增强方法在MICCAI 2024挑战赛中荣获第二名,并证明模型性能更取决于训练方案而非大语言模型规模。
English Summary: This study systematically explores the design of 3D multimodal large language models for automated CT report generation, introducing knowledge-based augmentation methods that achieved second place in a MICCAI 2024 challenge while demonstrating that model performance depends more on training protocols than LLM size.
Authors:Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
Abstract:
People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat
中文摘要:本研究通过分析1.1万条真实医疗对话数据,揭示了用户与AI交互时存在信息不完整、诱导性提问等风险,强调需提升医疗对话AI的辅助能力。
English Summary: This study analyzes 11,000 real-world health conversations with AI chatbots, revealing user interaction patterns and risks like incomplete context and sycophancy that highlight the need for improved healthcare AI capabilities.
Authors:Yihan Wang, Jia Deng
Abstract:
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.
中文: WAFT是一种创新的光流方法,通过用高分辨率扭曲替代代价体积,在降低内存占用的同时实现了最优的基准测试性能,且速度比同类方法更快。
English: WAFT is a novel optical flow method that replaces cost volumes with high-resolution warping, achieving top benchmark performance with lower memory usage and faster speed than comparable approaches.
Authors:Yihan Wang, Jia Deng
Abstract:
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.
中文: WAFT是一种创新的光流方法,通过用高分辨率扭曲替代代价体积,在降低内存占用的同时实现了最优的基准测试性能,且速度比同类方法更快。
English: WAFT is a novel optical flow method that replaces cost volumes with high-resolution warping, achieving top benchmark performance with lower memory usage and faster speed than comparable approaches.
Authors:Mohammed Rakib, Arunkumar Bagavathi
Abstract:
Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.
Chinese: 本文提出梯度引导蒸馏(G²D)框架,通过定制化损失函数和动态顺序模态优先级技术解决多模态学习中的模态不平衡问题,有效增强弱模态并提升分类和回归任务的性能。
English: The paper introduces Gradient-Guided Distillation (G²D), a knowledge distillation framework that addresses modality imbalance in multimodal learning by using a custom loss function and dynamic sequential modality prioritization to enhance weak modalities and improve performance across tasks.
Authors:Marek Å uppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, KristÃna Sásiková, Martin Tamajka, Marián Å imko
Abstract:
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
Chinese: 本文介绍了skLEP,这是首个专为评估斯洛伐克自然语言理解模型设计的综合基准,包含九项多样化任务和原始数据集,通过对多种模型进行全面评估并公开所有资源,旨在推动该领域的未来研究。
English: This paper introduces skLEP, the first comprehensive benchmark for evaluating Slovak natural language understanding models, featuring nine diverse tasks and original datasets, along with an extensive evaluation of various models and the release of all resources to promote future research.
Authors:Tin DizdareviÄ, Ravi Hammond, Tobias Gessler, Anisoara Calinescu, Jonathan Cook, Matteo Gallici, Andrei Lupu, Darius Muglich, Johannes Forkel, Jakob Nicolaus Foerster
Abstract:
Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.
中文: 本文提出Ad-Hoc Human-AI协调挑战(AH2AC2),通过开发人类代理智能体并提供有限数据集,以解决《花火》游戏中人类评估的局限性,促进数据高效的人机协调方法发展。
English: This paper introduces the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to address the limitations of human evaluations in Hanabi by developing human proxy agents and providing a limited dataset to promote data-efficient methods for human-AI coordination.
Authors:Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger H. J. Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, Marc Aubreville, Christof A. Bertram
Abstract:
Atypical mitosis marks a deviation in the cell division process that has been shown be an independent prognostic marker for tumor malignancy. However, atypical mitosis classification remains challenging due to low prevalence, at times subtle morphological differences from normal mitotic figures, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. Our work shows that atypical mitotic figure classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make all code and data used in this paper available in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
中文摘要:本研究针对乳腺癌非典型有丝分裂图像分类难题,通过迁移学习和微调技术建立了深度学习基准方法,在引入新数据集验证下取得了较高分类准确率。
English Summary: This study benchmarks deep learning methods for classifying atypical mitotic figures in breast cancer, achieving high accuracy through transfer learning and fine-tuning techniques while introducing new datasets for rigorous evaluation.
Authors:Samuel Joutard, Marijn Stollenga, Marc Balle Sanchez, Mohammad Farid Azampour, Raphael Prevost
Abstract:
Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT
中文: HyperSORT提出了一种超网络框架,通过学习UNet参数的分布来识别和表征数据集中的偏差,从而在医学影像中实现稳健的分割和系统性错误检测。
English: HyperSORT introduces a hyper-network framework that learns a distribution of UNet parameters to identify and characterize dataset biases, enabling robust segmentation and systematic error detection in medical imaging.
Authors:Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
Abstract:
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
中文:提出的XVerse模型通过将参考图像转换为文本流偏移量,实现了对文本到图像生成中多主体的精确独立控制,在不破坏图像特征的情况下完成高保真且可编辑的多主体合成。
English: The proposed XVerse model enables precise and independent control over multiple subjects in text-to-image generation by converting reference images into text-stream offsets, achieving high-fidelity and editable multi-subject synthesis without disrupting image features.
Authors:Zhirui Gao, Renjiao Yi, Yaqiao Dai, Xuening Zhu, Wei Chen, Chenyang Zhu, Kai Xu
Abstract:
This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
中文摘要:本文提出CurveGaussian单阶段框架,通过参数曲线与高斯组件的双向耦合机制及自适应拓扑优化,直接从多视角边缘图重建三维参数曲线,相比两阶段方法实现了更高效率和更优重建质量。
English Summary: This paper introduces CurveGaussian, a one-stage framework that directly reconstructs 3D parametric curves from multi-view edge maps through a bi-directional coupling mechanism and adaptive topology optimization, achieving superior efficiency and reconstruction quality over two-stage methods.
Authors:Can Liu, Chunlin Da, Xiaoxiao Long, Yuxiao Yang, Yu Zhang, Yong Wang
Abstract:
Current multimodal large language models (MLLMs), while effective in natural image understanding, struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information. To address these challenges, we propose SimVec, a novel simplified vector format that encodes chart elements such as mark type, position, and size. The effectiveness of SimVec is demonstrated by using MLLMs to reconstruct chart information from SimVec formats. Then, we build a new visualization dataset, SimVecVis, to enhance the performance of MLLMs in visualization understanding, which consists of three key dimensions: bitmap images of charts, their SimVec representations, and corresponding data-centric question-answering (QA) pairs with explanatory chain-of-thought (CoT) descriptions. We finetune state-of-the-art MLLMs (e.g., MiniCPM and Qwen-VL), using SimVecVis with different dataset dimensions. The experimental results show that it leads to substantial performance improvements of MLLMs with good spatial perception capabilities (e.g., MiniCPM) in data-centric QA tasks. Our dataset and source code are available at: https://github.com/VIDA-Lab/SimVecVis.
Chinese: 当前多模态大语言模型在可视化理解方面存在不足,因此我们提出了SimVec这一简化向量格式来编码图表元素,并构建了SimVecVis数据集,显著提升了模型在数据问答任务中的表现。
English: Current multimodal large language models (MLLMs) struggle with visualization understanding, so we propose SimVec, a simplified vector format that encodes chart elements, and build the SimVecVis dataset to significantly enhance MLLMs' performance in data-centric question-answering tasks.
Authors:Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno
Abstract:
Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.
中文: 当前视觉语言模型因缺乏专业数据而在复杂人体姿态和动作任务上表现不佳,但我们通过整合人体关键点与视觉特征的方法构建了专用数据集,微调后使模型性能显著提升了33.2%。
English: Current vision-language models struggle with complex human pose and action tasks due to a lack of specialized data, but our method integrating human keypoints with visual features creates a dataset that significantly improves model performance by 33.2% when fine-tuned.
Authors:Martin Lange, Patricia Guerra-Balboa, Javier Parra-Arnau, Thorsten Strufe
Abstract:
Privacy risks in differentially private (DP) systems increase significantly when data is correlated, as standard DP metrics often underestimate the resulting privacy leakage, leaving sensitive information vulnerable. Given the ubiquity of dependencies in real-world databases, this oversight poses a critical challenge for privacy protections. Bayesian differential privacy (BDP) extends DP to account for these correlations, yet current BDP mechanisms indicate notable utility loss, limiting its adoption.
In this work, we address whether BDP can be realistically implemented in common data structures without sacrificing utility -- a key factor for its applicability. By analyzing arbitrary and structured correlation models, including Gaussian multivariate distributions and Markov chains, we derive practical utility guarantees for BDP. Our contributions include theoretical links between DP and BDP and a novel methodology for adapting DP mechanisms to meet the BDP requirements. Through evaluations on real-world databases, we demonstrate that our novel theorems enable the design of BDP mechanisms that maintain competitive utility, paving the way for practical privacy-preserving data practices in correlated settings.
中文: 本研究通过理论关联和创新适配方法,验证了贝叶斯差分隐私可在相关数据结构中实际应用并保持良好效用,为现实场景中的隐私保护实践开辟了新途径。
English: This work demonstrates that Bayesian differential privacy can be practically implemented for correlated data structures while maintaining competitive utility, through theoretical connections and novel adaptation methods validated on real-world databases.
Authors:Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Abstract:
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less
中文摘要:增强大型语言模型的外部上下文可提升其自然语言处理性能,但确保回答严格基于上下文仍具挑战;本研究提出一种轻量级编码器模型,在昂贵的LLM生成答案前检测查询是否基于文档,能以极低延迟和资源消耗达到与先进LLMs相当的检测精度。
English Summary: Augmenting LLMs with external context boosts NLP performance, but ensuring grounded responses remains challenging; this study introduces a lightweight encoder model that detects query grounding in documents before costly LLM processing, achieving comparable accuracy to advanced LLMs while drastically cutting latency and resource use.
Authors:Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
Abstract:
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker
中文: 本文提出的Double-Checker框架通过让慢思考大语言模型进行迭代式自我批判和答案优化,显著提升了推理能力,在AIME等基准测试中的通过率从4.4%提升至18.2%。
English: This paper introduces Double-Checker, a framework that enhances slow-thinking LLMs' reasoning by enabling iterative self-critique and refinement of solutions, significantly improving performance on reasoning benchmarks like AIME from 4.4% to 18.2%.
Authors:Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
Abstract:
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker
中文: 本文提出的Double-Checker框架通过让慢思考大语言模型进行迭代式自我批判和答案优化,显著提升了推理能力,在AIME等基准测试中的通过率从4.4%提升至18.2%。
English: This paper introduces Double-Checker, a framework that enhances slow-thinking LLMs' reasoning by enabling iterative self-critique and refinement of solutions, significantly improving performance on reasoning benchmarks like AIME from 4.4% to 18.2%.
Authors:Jiayi Zheng, Xiaodong Cun
Abstract:
We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen
中文摘要:FairyGen是一个从单张儿童绘画自动生成故事驱动卡通视频的系统,通过风格传播适配器、镜头设计模块和运动定制技术,在保持原画独特艺术风格的同时实现连贯的叙事动画。
English Summary: FairyGen is an automated system that generates story-driven cartoon videos from a single child's drawing while preserving its unique artistic style through style propagation, cinematic shot design, and motion customization.
Authors:Xianghan Meng, Zhengyu Tong, Zhiyuan Huang, Chun-Guang Li
Abstract:
Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the sequences of frames in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ enjoy temporally consistency and are aligned well with a UoS structure, which is favorable for addressing the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors. The code is available at: https://github.com/mengxianghan123/TR2C.
Chinese Summary: 本文提出时间速率降低聚类(TR²C)方法,通过联合学习结构化表示和关联性来分割视频帧序列,其学得的表示具有时间一致性且符合子空间联合结构,在多个基准测试中实现了最优性能。
English Summary: This paper introduces Temporal Rate Reduction Clustering (TR²C), a novel method for human motion segmentation that learns structured representations aligned with Union-of-Subspaces to effectively partition video frames into distinct motions, achieving state-of-the-art results across multiple benchmarks.
Authors:Xiwei Xuan, Ziquan Deng, Kwan-Liu Ma
Abstract:
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .
Chinese: 本研究提出一种以数据为中心的框架,通过构建具有对齐片段-文本嵌入的高质量参考集,无需模型微调即可在十个基准测试中实现卓越性能,显著提升了免训练开放词汇语义分割的效果。
English: This study introduces a data-centric framework that enhances training-free open-vocabulary semantic segmentation by constructing a high-quality reference set with aligned segment-text embeddings, achieving superior performance across ten benchmarks without model fine-tuning.
Authors:Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng, Hang Liu, Kailun Yang, Hui Zhang
Abstract:
Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
中文: 本文提出UNLOCK方法,无需源数据或目标标签即可实现全景图像的无缝分割,有效应对扭曲和遮挡问题,其性能与依赖源数据的方法相当,达到领先水平。
English: This paper introduces UNLOCK, a source-free method for panoramic image segmentation that overcomes distortions and occlusions without requiring source data or target labels, achieving state-of-the-art performance comparable to source-dependent approaches.
Authors:Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, Kailun Yang
Abstract:
3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.
Chinese: 本研究提出了分布外语义占据预测方法和OccOoD框架,通过合成数据集增强自动驾驶系统在三维体素空间中的异常检测能力,在保持竞争力的占据预测性能同时实现了最先进的分布外检测效果。
English: This study introduces Out-of-Distribution Semantic Occupancy Prediction and the OccOoD framework to enhance autonomous driving safety by detecting anomalies in 3D voxel space, supported by synthetic datasets and achieving state-of-the-art OoD detection performance.
Authors:Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
Abstract:
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
中文: 本文阐述了提升大规模文本嵌入基准(MTEB)可复现性和可扩展性的工程实践,包括稳健的持续集成流程以及处理社区贡献与扩展数据集的方法。
English: This paper details the engineering practices that enhance the reproducibility and extensibility of the Massive Text Embedding Benchmark (MTEB), including robust continuous integration pipelines and strategies for community contributions and dataset expansion.
Authors:Longkun Zou, Kangjun Liu, Ke Chen, Kailing Guo, Kui Jia, Yaowei Wang
Abstract:
Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at https://github.com/zou-longkun/TAG.git.
中文摘要:本研究提出的拓扑感知建模框架通过利用全局空间拓扑结构和新型自监督学习任务,有效解决了三维点云分类中的仿真到现实领域差异问题,在多个基准测试中展现出优越性能。
English Summary: The proposed Topology-Aware Modeling framework addresses the Sim2Real domain gap in 3D point cloud classification by leveraging global spatial topology and novel self-supervised learning, demonstrating superior performance on benchmarks.
Authors:He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang
Abstract:
The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master.
中文: 本文提出了一种基于Transformer的新颖框架,用于估计具有时空特征的反事实结果,通过模拟实验和对哥伦比亚冲突影响的真实数据分析,证明了该方法优于基线模型的性能。
English: This paper introduces a novel Transformer-based framework for estimating counterfactual outcomes with spatial-temporal attributes, demonstrating superior performance over baseline methods through simulations and real-world experiments on conflict impacts in Colombia.
Authors:Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu
Abstract:
Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformer) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments under four evaluation settings on three paradigms, including motor imagery, seizure detection, and steady-state visual evoked potential, demonstrated that DBConformer consistently outperformed 13 competitive baseline models, with over an eight-fold reduction in parameters than current high-capacity EEG Conformer architecture. Furthermore, the visualization results confirmed that the features extracted by DBConformer are physiologically interpretable and aligned with prior knowledge. The superior performance and interpretability of DBConformer make it reliable for accurate, robust, and explainable EEG decoding. Code is publicized at https://github.com/wzwvv/DBConformer.
中文:提出的DBConformer模型通过结合时序与空间Transformer及通道注意力机制,显著提升了EEG解码性能,在多种实验范式中均表现出优越的准确性和生理可解释性,同时大幅减少了参数量。
English: The proposed DBConformer model enhances EEG decoding by combining temporal and spatial Transformers with channel attention, achieving superior performance and interpretability across multiple paradigms while significantly reducing parameters.
Authors:Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu
Abstract:
Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.
中文摘要:本研究提出了一个名为SIED的新数据集,用于增强极暗环境下的RAW图像,并开发了一种基于扩散的框架,能够从极低信噪比的RAW输入中有效恢复高质量的视觉效果。
English Summary: This study introduces a new dataset called SIED for enhancing extremely low-light RAW images and proposes a diffusion-based framework that effectively restores high-quality visuals from very dark inputs.
Authors:Luosheng Xu, Dalin Zhang, Zhaohui Song
Abstract:
Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.
中文: 本研究提出轻量级遥感变化检测模型FlickCD,通过增强差异模块和局部-全局融合块在显著降低计算与存储开销的同时保持高精度,以最小资源消耗实现了最优性能。
English: This study introduces FlickCD, a lightweight remote sensing change detection model that significantly reduces computational and storage demands while maintaining high accuracy through its Enhanced Difference Module and Local-Global Fusion Blocks, achieving state-of-the-art performance with minimal resource consumption.
Authors:Tim Lawson, Laurence Aitchison
Abstract:
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
中文: 本文提出了一种新型Transformer架构,通过学习的门控机制动态跳过中间层的可变数量,但与层数更少的基线模型相比,该方法未能在效率与准确性的权衡中取得改进。
English: This paper introduces a novel Transformer architecture that dynamically skips variable numbers of middle layers using a learned gating mechanism, though it fails to improve the efficiency-accuracy trade-off compared to fewer-layer baselines.
Authors:Yann Kerzreho
Abstract:
This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite-state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent's parameters as a slow-evolving variable influenced by the fast-mixing game state. Under mild assumptions-ergodicity of the state process and continuity of the updates-we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent's learning dynamics. An implementation of the framework is available at\,: https://github.com/yannKerzreho/MarkovGameApproximation
中文: 本文提出一种重标度学习方法,通过将智能体参数视为慢变变量来近似马尔可夫博弈中的多智能体强化学习动态,并在遍历性和连续性假设下证明了该方法会收敛到确定性常微分方程。
English: This paper presents a rescaled learning method that approximates multi-agent reinforcement learning dynamics in Markov games by treating agent parameters as slow variables, with proven convergence to a deterministic ODE under ergodicity and continuity assumptions.
Authors:Demin Zhang, Jiahao Lyu, Zhijie Shen, Yu Zhou
Abstract:
Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.
中文摘要:本文提出了一种名为“类别无关兴趣区域匹配”的新任务,通过构建基准测试和采用孪生网络框架,实现了灵活、多粒度的文档区域自定义匹配,为文档分析提供了有效的解决方案。
English Summary: This paper introduces a new task called "Class-Agnostic Region-of-Interest Matching" (RoI-Matching) to enable flexible, user-customized document analysis, proposing both a benchmark and a siamese network-based framework that effectively handles multi-granularity matching across documents.
Authors:Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li
Abstract:
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.
中文: 本文提出dSVA攻击方法,利用自监督视觉变换器的全局与局部特征生成对抗样本,在多种架构的黑盒模型中展现出卓越的迁移性能。
English: This paper introduces dSVA, a generative attack that leverages both global and local features from self-supervised Vision Transformers to craft adversarial examples with superior black-box transferability across diverse model architectures.
Authors:Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Abstract:
Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}.
中文: 本文提出一种方法,利用单步扩散过程的中间特征,将推理时间减少75%的同时提升源域性能,并通过构建以物体为中心的辅助分支和一致性损失,有效提高跨域检测任务的泛化性和迁移能力。
English: This paper introduces a method that leverages intermediate features from a single-step diffusion process to significantly reduce inference time by 75% while enhancing performance on source domains, and employs an object-centered auxiliary branch with consistency loss to improve generalization and transferability in cross-domain detection tasks.
Authors:Lei Hao, Lina Xu, Chang Liu, Yanni Dong
Abstract:
Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.
Chinese: 本研究提出了一种轻量级多模态目标检测网络LASFNet,通过单一融合单元和注意力机制,在多个数据集上实现检测精度提升1%-3%的同时,将计算成本最高降低90%。
English: This study introduces LASFNet, a lightweight multimodal object detection network that uses a single fusion unit and attention mechanisms to significantly reduce computational costs by up to 90% while improving detection accuracy by 1%-3% across multiple datasets.
Authors:Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad, Talal Arshad, Hafsa Nebbache, Jin Chen, Abdullah Imran
Abstract:
Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.
中文: 一种结合Segment Anything模型与前向对比学习的新型深度学习框架,显著提高了乳腺癌手术中切缘评估的速度和准确性,其改进的分类和分割效果有望降低再切除率。
English: A novel deep learning framework combining the Segment Anything Model with Forward-Forward Contrastive Learning significantly improves the speed and accuracy of intraoperative margin assessment in breast cancer surgery, achieving enhanced classification and segmentation results that could reduce re-excision rates.
Authors:Qiuyi Qi, Xin Li, Ming Kong, Zikang Xu, Bingdi Chen, Qiang Zhu, S Kevin Zhou
Abstract:
Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.
中文摘要:本文提出的风格对齐图像合成(SAIC)方法通过生成高保真病理图像,有效解决数据质量不足和染色差异问题,显著提升了异常细胞检测模型的鲁棒性和综合性能,在临床应用中展现出良好泛化能力。
English Summary: This paper introduces the Style-Aligned Image Composition (SAIC) method, which generates high-fidelity pathological images to enhance abnormal cell detection by addressing data limitations and staining inconsistencies, thereby improving model robustness and performance across diverse clinical scenarios.
Authors:Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Abstract:
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.
中文摘要:本文提出SP-Ctrl方法,通过增强稀疏姿态信号实现更有效的姿态引导图像生成,在保持简单性的同时达到与密集信号方法相当的性能,并展现出优异的跨物种生成能力。
English Summary: This paper introduces SP-Ctrl, a novel method that enhances sparse pose signals for more effective pose-guided image generation, achieving performance comparable to dense signal approaches while offering improved simplicity and cross-species generation capabilities.
Authors:Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu
Abstract:
Real-world networks usually have a property of node heterophily, that is, the connected nodes usually have different features or different labels. This heterophily issue has been extensively studied in homogeneous graphs but remains under-explored in heterogeneous graphs, where there are multiple types of nodes and edges. Capturing node heterophily in heterogeneous graphs is very challenging since both node/edge heterogeneity and node heterophily should be carefully taken into consideration. Existing methods typically convert heterogeneous graphs into homogeneous ones to learn node heterophily, which will inevitably lose the potential heterophily conveyed by heterogeneous relations. To bridge this gap, we propose Relation-Aware Separation of Homophily and Heterophily (RASH), a novel contrastive learning framework that explicitly models high-order semantics of heterogeneous interactions and adaptively separates homophilic and heterophilic patterns. Particularly, RASH introduces dual heterogeneous hypergraphs to encode multi-relational bipartite subgraphs and dynamically constructs homophilic graphs and heterophilic graphs based on relation importance. A multi-relation contrastive loss is designed to align heterogeneous and homophilic/heterophilic views by maximizing mutual information. In this way, RASH simultaneously resolves the challenges of heterogeneity and heterophily in heterogeneous graphs. Extensive experiments on benchmark datasets demonstrate the effectiveness of RASH across various downstream tasks. The code is available at: https://github.com/zhengziyu77/RASH.
中文摘要:RASH框架通过引入双重异质超图和对比学习,在保留关系语义的同时自适应分离同质与异质模式,解决了异质图中节点异质性研究不足的挑战。
English Summary: The proposed RASH framework addresses the under-explored challenge of node heterophily in heterogeneous graphs by introducing dual hypergraphs and contrastive learning to adaptively separate homophilic and heterophilic patterns while preserving relational semantics.
Authors:Naihe Feng, Yi Sui, Shiyi Hou, Jesse C. Cresswell, Ga Wu
Abstract:
Existing research on Retrieval-Augmented Generation (RAG) primarily focuses on improving overall question-answering accuracy, often overlooking the quality of sub-claims within generated responses. Recent methods that attempt to improve RAG trustworthiness, such as through auto-evaluation metrics, lack probabilistic guarantees or require ground truth answers. To address these limitations, we propose Conformal-RAG, a novel framework inspired by recent applications of conformal prediction (CP) on large language models (LLMs). Conformal-RAG leverages CP and internal information from the RAG mechanism to offer statistical guarantees on response quality. It ensures group-conditional coverage spanning multiple sub-domains without requiring manual labelling of conformal sets, making it suitable for complex RAG applications. Compared to existing RAG auto-evaluation methods, Conformal-RAG offers statistical guarantees on the quality of refined sub-claims, ensuring response reliability without the need for ground truth answers. Additionally, our experiments demonstrate that by leveraging information from the RAG system, Conformal-RAG retains up to 60\% more high-quality sub-claims from the response compared to direct applications of CP to LLMs, while maintaining the same reliability guarantee.
中文: Conformal-RAG提出了一种利用保形预测的新框架,为RAG生成回答中的子声明质量提供统计保证,无需真实答案即可确保可靠性,同时保留更多高质量内容。
English: Conformal-RAG introduces a novel framework using conformal prediction to provide statistical guarantees on the quality of sub-claims in RAG-generated responses, eliminating the need for ground truth answers while maintaining reliability and retaining more high-quality content.
Authors:Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duan, Fu-Chao Xie, Wen-Kai Wang, Si-Cheng Wang, Ling-Yun Li, Tian Tu, Zeng-Guang Hou
Abstract:
Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to address the challenge of improving an embodiment's ability to interact with the environment for the given tasks, analogous to the process of humans motor skills acquisition. Accordingly, this paper reviews post-training strategies for VLA models through the lens of human motor learning, focusing on three dimensions: environments, embodiments, and tasks. A structured taxonomy is introduced aligned with human learning mechanisms: (1) enhancing environmental perception, (2) improving embodiment awareness, (3) deepening task comprehension, and (4) multi-component integration. Finally, key challenges and trends in post-training VLA models are identified, establishing a conceptual framework to guide future research. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. (Project website: https://github.com/AoqunJin/Awesome-VLA-Post-Training)
中文: 视觉-语言-动作模型通过整合动作生成模块增强机器人操作能力,但需借鉴人类运动学习机制进行后训练以提升任务精度和适应性。
English: Vision-language-action (VLA) models enhance robotic manipulation by integrating action generation but require post-training to improve precision and adaptability, drawing inspiration from human motor learning strategies.
Authors:Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
Abstract:
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures, enabling efficient and localized insertions of new data without disrupting the existing topology. The design eliminates the need for retraining or costly recomputation while preserving high retrieval accuracy and low latency. Experiments on large-scale benchmarks demonstrate that EraRag achieves up to an order of magnitude reduction in update time and token consumption compared to existing Graph-RAG systems, while providing superior accuracy performance. This work offers a practical path forward for RAG systems that must operate over continually growing corpora, bridging the gap between retrieval efficiency and adaptability. Our code and data are available at https://github.com/EverM0re/EraRAG-Official.
中文摘要:EraRAG提出了一种动态图检索增强生成框架,通过分层图结构和局部敏感哈希分区实现对不断更新的文档库的高效维护,在无需全图重建的情况下显著提升了更新速度与检索精度。
English Summary: EraRAG introduces a dynamic Graph-RAG framework that enables efficient updates to evolving document corpora through hierarchical graph structures and LSH partitioning, achieving significant improvements in update speed and accuracy without full-graph reconstruction.
Authors:Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
Abstract:
Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph \textbf{D}iffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\% rise in improvement percentage, and a 0.062~Ã
reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD's ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: https://github.com/Patrick221215/AbMEGD.
中文: 提出的AbMEGD框架通过整合多尺度等变图扩散技术,实现了抗体序列与结构的协同设计,在氨基酸恢复率和结构精度方面均优于现有模型,为复杂抗原的抗体设计确立了新标准。
English: The proposed AbMEGD framework integrates multi-scale equivariant graph diffusion to overcome current limitations in antibody design by co-designing sequences and structures, achieving superior performance in amino acid recovery and structural accuracy compared to existing models.
Authors:Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli
Abstract:
Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.
中文: 本文提出了MultiHuman-Testbench这一包含1800个样本和5550张多样化人脸的基准测试,用于评估多人图像生成模型,通过创新的评估指标和技术显著提升了身份相似度,为该领域研究提供了标准化工具。
English: This paper introduces MultiHuman-Testbench, a comprehensive benchmark with 1,800 samples and 5,550 diverse face images to evaluate multi-human image generation models, proposing novel evaluation metrics and techniques that significantly improve identity similarity and provide standardized tools for advancing the field.
Authors:Milad Hasanzadeh, Amin Kargarian
Abstract:
\textit{DPLib} is an open-source MATLAB-based benchmark library created to support research and development in distributed and decentralized power system analysis and optimization. Distributed and decentralized methods offer scalability, privacy preservation, and resilience to single points of failure, making them increasingly important for modern power systems. However, unlike centralized tools such as MATPOWER, no general-purpose, reproducible data library package currently exists for distributed power system studies. DPLib, available at \href{https://github.com/LSU-RAISE-LAB/DPLib.git}{GitHub}, fills this gap by providing a standard power system library featuring over 20 multi-region benchmark test cases of varying sizes, along with a graph-based partitioning toolkit that decomposes any MATPOWER test system into multiple electrically coherent regions. The partitioning toolkit, an easy-to-use MATLAB code, generates standardized \texttt{.mat} and \texttt{.m} files, along with region visualizations for intuitive understanding. We also provide modular, easy-to-use distributed optimal power flow (OPF) solvers: an alternating direction method of multipliers(ADMM)-based DC-OPF solver implemented in YALMIP, and an ADMM-based AC-OPF solver leveraging IPOPT. These solvers validate the generated test systems for distributed optimization applications. Numerical results validate the generated test cases, establishing DPLib as a foundation for reproducible distributed power system research.
DPLib是一个开源的MATLAB基准库,通过提供标准化的多区域电力系统测试案例和分区工具,填补了分布式电力系统研究中缺乏通用数据包的空白,支持可重复性研究。
DPLib is an open-source MATLAB library that provides standardized multi-region power system test cases and partitioning tools to support reproducible research in distributed and decentralized power system optimization.
Authors:Ali Tourani, Fatemeh Nazary, Yashar Deldjoo
Abstract:
This paper addresses the challenge of developing multimodal recommender systems for the movie domain, where limited metadata (e.g., title, genre) often hinders the generation of robust recommendations. We introduce a resource that combines LLM-generated plot descriptions with trailer-derived visual embeddings in a unified pipeline supporting both Retrieval-Augmented Generation (RAG) and collaborative filtering. Central to our approach is a data augmentation step that transforms sparse metadata into richer textual signals, alongside fusion strategies (e.g., PCA, CCA) that integrate visual cues. Experimental evaluations demonstrate that CCA-based fusion significantly boosts recall compared to unimodal baselines, while an LLM-driven re-ranking step further improves NDCG, particularly in scenarios with limited textual data. By releasing this framework, we invite further exploration of multi-modal recommendation techniques tailored to cold-start, novelty-focused, and domain-specific settings. All code, data, and detailed documentation are publicly available at: https://github.com/RecSys-lab/RAG-VisualRec
中文摘要:本文提出了一种多模态电影推荐系统,通过融合大模型生成的剧情描述与预告片视觉特征,结合跨模态融合与重排序策略,显著提升了冷启动和稀疏数据场景下的推荐效果。
English Summary: This paper introduces a multimodal movie recommender system that enhances sparse metadata with LLM-generated plots and visual trailer embeddings, achieving superior performance through cross-modal fusion and LLM-driven re-ranking.
Authors:Lucius Bushnaq, Dan Braun, Lee Sharkey
Abstract:
A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition -- a framework that has been proposed to resolve several issues with current decomposition methods -- decomposes neural network parameters into a sum of sparsely used vectors in parameter space. However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters. In this work, we introduce \textit{Stochastic Parameter Decomposition} (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose with APD. We also show that SPD avoids other issues, such as shrinkage of the learned parameters, and better identifies ground truth mechanisms in toy models. By bridging causal mediation analysis and network decomposition methods, this demonstration opens up new research possibilities in mechanistic interpretability by removing barriers to scaling linear parameter decomposition methods to larger models. We release a library for running SPD and reproducing our experiments at https://github.com/goodfire-ai/spd/tree/spd-paper.
中文: 本文提出随机参数分解(SPD)方法,相比现有技术更具可扩展性和鲁棒性,能够分解更复杂的神经网络模型,为机制可解释性研究开辟了新途径。
English: This paper introduces Stochastic Parameter Decomposition (SPD), a more scalable and robust method than previous approaches, enabling decomposition of larger neural networks and advancing mechanistic interpretability research.
Authors:Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
Abstract:
Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at https://github.com/Hither1/systems-scaling.
Chinese: 采用块缩放精度格式(如微缩放MX)训练大型语言模型会因量化参数引入梯度偏差,导致损失出现随机不稳定性,但通过混合精度策略可有效稳定训练,实现与全精度方法相媲美的性能。
English: Training large language models with block-scaled precision formats like Microscaling (MX) introduces stochastic instabilities in loss, primarily due to gradient bias from quantized parameters, but hybrid precision strategies can stabilize training and achieve performance comparable to full-precision methods.
Authors:Qin Ren, Yifan Wang, Ruogu Fang, Haibin Ling, Chenyu You
Abstract:
Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at https://github.com/Y-Research-SBU/OTSurv.
中文:OTSurv提出了一种基于最优传输的新型多示例学习框架,通过全局长尾约束和局部不确定性约束解决全切片图像中的病理异质性问题,在生存预测中实现了最先进的性能与高可解释性。
English: OTSurv introduces a novel multiple instance learning framework using optimal transport to address pathological heterogeneity in whole slide images, achieving state-of-the-art survival prediction with improved accuracy and interpretability.
Authors:Yiming Wang, Arthur N. Montanari, Adilson E. Motter
Abstract:
Nonlinear networks are often multistable, exhibiting coexisting stable states with competing regions of attraction (ROAs). As a result, ROAs can have complex "tentacle-like" morphologies that are challenging to characterize analytically or computationally. In addition, the high dimensionality of the state space prohibits the automated construction of Lyapunov functions using state-of-the-art optimization methods, such as sum-of-squares (SOS) programming. In this letter, we propose a distributed approach for the construction of Lyapunov functions based solely on local information. To this end, we establish an augmented comparison lemma that characterizes the existence conditions of partial Lyapunov functions, while also accounting for residual effects caused by the associated dimensionality reduction. These theoretical results allow us to formulate an SOS optimization that iteratively constructs such partial functions, whose aggregation forms a composite Lyapunov function. The resulting composite function provides accurate convex approximations of both the volumes and shapes of the ROAs. We validate our method on networks of van der Pol and Ising oscillators, demonstrating its effectiveness in characterizing high-dimensional systems with non-convex ROAs.
Chinese: 本研究提出了一种基于局部信息的分布式李雅普诺夫函数构造方法,通过平方和优化技术有效逼近高维非线性网络中具有复杂形态的吸引域。
English: This study introduces a distributed method for constructing composite Lyapunov functions using sum-of-squares optimization, enabling accurate approximation of complex, high-dimensional attraction regions in multistable nonlinear networks.
Authors:Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang
Abstract:
Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science-informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute- and memory-intensive operators, 2D attention mechanisms, and retrieval-augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23$\times$ and improves per-iteration training time by up-to 1.73$\times$ and 1.62$\times$ respectively. More importantly, MegaFold enables training on 1.35$\times$ longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models. We open source our code at https://github.com/Supercomputing-System-AI-Lab/MegaFold/.
中文:MegaFold是一个跨平台系统,通过优化数据管道、注意力机制和算子融合来加速AlphaFold3训练,在英伟达和AMD GPU上显著提升了内存效率和训练速度。
English: MegaFold is a cross-platform system that accelerates AlphaFold3 training by optimizing data pipelines, attention mechanisms, and operator fusion, achieving significant memory and speed improvements on both NVIDIA and AMD GPUs.
Authors:Alexander Selivanov, Philip Müller, Ãzgün Turgut, Nil Stolt-Ansó, Daniel Rückert
Abstract:
An electrocardiogram (ECG) is a widely used, cost-effective tool for detecting electrical abnormalities in the heart. However, it cannot directly measure functional parameters, such as ventricular volumes and ejection fraction, which are crucial for assessing cardiac function. Cardiac magnetic resonance (CMR) is the gold standard for these measurements, providing detailed structural and functional insights, but is expensive and less accessible. To bridge this gap, we propose PTACL (Patient and Temporal Alignment Contrastive Learning), a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal information from CMR. PTACL uses global patient-level contrastive loss and local temporal-level contrastive loss. The global loss aligns patient-level representations by pulling ECG and CMR embeddings from the same patient closer together, while pushing apart embeddings from different patients. Local loss enforces fine-grained temporal alignment within each patient by contrasting encoded ECG segments with corresponding encoded CMR frames. This approach enriches ECG representations with diagnostic information beyond electrical activity and transfers more insights between modalities than global alignment alone, all without introducing new learnable weights. We evaluate PTACL on paired ECG-CMR data from 27,951 subjects in the UK Biobank. Compared to baseline approaches, PTACL achieves better performance in two clinically relevant tasks: (1) retrieving patients with similar cardiac phenotypes and (2) predicting CMR-derived cardiac function parameters, such as ventricular volumes and ejection fraction. Our results highlight the potential of PTACL to enhance non-invasive cardiac diagnostics using ECG. The code is available at: https://github.com/alsalivan/ecgcmr
中文: PTACL是一种多模态对比学习框架,通过整合心脏磁共振的时空数据来增强心电图表征,无需额外参数即可提升心脏表型检索和功能参数预测的性能。
English: PTACL is a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal CMR data, enabling improved cardiac phenotype retrieval and function parameter prediction without additional parameters.
Authors:Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
Abstract:
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
中文:MMSearch-R1是一种创新的强化学习框架,它使大型多模态模型能够通过整合图像和文本工具进行高效、按需的搜索,在减少30%以上搜索调用的同时,性能超越了现有方法。
English: MMSearch-R1 is a novel reinforcement learning framework that enables large multimodal models to perform efficient, on-demand searches using integrated image and text tools, significantly reducing search calls while outperforming existing methods.
Authors:Jacopo Dapueto, Vito Paolo Pastore, Nicoletta Noceti, Francesca Odone
Abstract:
Microscopy image analysis is fundamental for different applications, from diagnosis to synthetic engineering and environmental monitoring. Modern acquisition systems have granted the possibility to acquire an escalating amount of images, requiring a consequent development of a large collection of deep learning-based automatic image analysis methods. Although deep neural networks have demonstrated great performance in this field, interpretability, an essential requirement for microscopy image analysis, remains an open challenge.
This work proposes a Disentangled Representation Learning (DRL) methodology to enhance model interpretability for microscopy image classification. Exploiting benchmark datasets from three different microscopic image domains (plankton, yeast vacuoles, and human cells), we show how a DRL framework, based on transferring a representation learnt from synthetic data, can provide a good trade-off between accuracy and interpretability in this domain.
中文摘要:本研究提出了一种解耦表示学习方法,通过三个显微图像数据集验证了该方法能在保持分类精度的同时有效提升模型可解释性。
English Summary: This study introduces a Disentangled Representation Learning approach to improve interpretability in microscopy image classification, demonstrating through three datasets that it effectively balances accuracy with explainability.
Authors:Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang
Abstract:
Large language model-based machine learning (ML) agents have shown great promise in automating ML research. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, a novel agent that excels at exchanging insights and developing novel solutions within a community context. CoMind achieves state-of-the-art performance on MLE-Live and outperforms 79.2% human competitors on average across four ongoing Kaggle competitions. Our code is released at https://github.com/comind-ml/CoMind.
中文: MLE-Live是一个评估机器学习代理在模拟研究社区中协作能力的框架,基于该框架开发的CoMind新型代理在Kaggle竞赛中表现出色,平均超越79.2%的人类参赛者。
English: MLE-Live is a framework for evaluating ML agents' ability to collaborate with a simulated research community, and CoMind, a novel agent developed on this framework, demonstrates superior performance by outperforming most human competitors in Kaggle competitions.
Authors:Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
Abstract:
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
中文: 扩散大语言模型(dLLMs)通过全局规划和迭代优化为代码生成提供了新途径,本研究提出了基于1300亿代码标记训练的7B模型DiffuCoder及耦合GRPO强化学习方法,显著提升代码生成性能并降低自回归依赖。
English: Diffusion large language models (dLLMs) offer a novel approach to code generation with global planning and iterative refinement, and this study introduces DiffuCoder, a 7B model trained on 130B code tokens, along with a coupled-GRPO reinforcement learning method that enhances performance and reduces autoregressive bias.
Authors:Ji Qi, Xinchang Zhang, Dingqi Ye, Yongjia Ruan, Xin Guo, Shaowen Wang, Haifeng Li
Abstract:
The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet.
Chinese: 提出的SFNet框架通过融合空间和频率域特征,能有效检测多样化遥感图像中的伪造内容,在多个数据集上实现了显著精度提升并展现出强大泛化能力。
English: The proposed SFNet framework effectively detects fake remote sensing images by integrating spatial and frequency domain features, achieving significant accuracy improvements and robust generalization across diverse datasets.
Authors:Zhonghao Shi, Enyu Zhao, Nathaniel Dennler, Jingzhen Wang, Xinyang Xu, Kaleen Shrestha, Mengxue Fu, Daniel Seita, Maja MatariÄ
Abstract:
Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.
中文: HRIBench是一个为评估人机交互中人类感知任务而设计的视觉问答基准,结果表明现有模型在感知能力和性能-延迟权衡方面均未达到实时部署的要求。
English: HRIBench is introduced as a VQA benchmark to evaluate VLMs on human perception tasks for HRI, revealing that current models lack both sufficient perceptual capabilities and satisfactory performance-latency trade-offs for real-time deployment.
Authors:Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu
Abstract:
Vision Transformer has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network, which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at https://github.com/zlheui/AdvMIM.
Chinese: 视觉Transformer在医学图像分割中表现出色,但在标注数据有限时效果不佳,因此本文提出一种对抗性掩码图像建模方法,通过增强监督信号和缩小域间差异,显著提升了半监督学习的性能。
English: Vision Transformer excels in medical image segmentation but struggles with limited labeled data, so this paper introduces an adversarial masked image modeling method to enhance supervision signals and bridge domain gaps, significantly improving semi-supervised learning performance.
Authors:Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao
Abstract:
With the rapid advancement of deep learning, particularly through generative adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or ``deepfakes", have become nearly indistinguishable from real ones. These images are widely shared across Online Social Networks (OSNs), raising concerns about their misuse. Existing deepfake detection methods overlook the ``block effects" introduced by compression in OSNs, which obscure deepfake artifacts, and primarily focus on raw images, rarely encountered in real-world scenarios. To address these challenges, we propose PLADA (Pay Less Attention to Deceptive Artifacts), a novel framework designed to tackle the lack of paired data and the ineffective use of compressed images. PLADA consists of two core modules: Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to handle block effects, and Open Data Aggregation (ODA), which processes both paired and unpaired data to improve detection. Extensive experiments across 26 datasets demonstrate that PLADA achieves a remarkable balance in deepfake detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with limited paired data and compression. More importantly, this work introduces the ``block effect" as a critical factor in deepfake detection, providing a robust solution for open-world scenarios. Our code is available at https://github.com/ManyiLee/PLADA.
中文:提出的PLADA框架通过消除块效应并利用配对与非配对数据,有效解决了在线社交网络中压缩图像深度伪造检测的难题,在多种数据集上实现了卓越性能。
English: The proposed PLADA framework effectively addresses deepfake detection challenges in compressed images from online social networks by eliminating block effects and leveraging both paired and unpaired data, achieving superior performance across diverse datasets.
Authors:Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang
Abstract:
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents WattsOnAI, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, WattsOnAI offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, WattsOnAI encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.
中文: 本文介绍了WattsOnAI工具包,它通过集成测量、分析和可视化AI工作负载的能耗与碳排放,解决了现有工具零散的问题,并借助标准化报告和关联分析推动可持续的绿色AI实践。
English: This paper introduces WattsOnAI, a comprehensive toolkit that addresses the fragmentation in existing tools by providing integrated measurement, analysis, and visualization of energy use and carbon emissions in AI workloads, promoting sustainable practices through standardized reporting and correlation analysis.
Authors:Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
中文摘要:ReCode是一种新颖的强化学习框架,通过基于版本迁移数据训练大语言模型,显著提升其在动态API环境中的代码生成可靠性,同时对其通用编程能力影响较小。
English Summary: ReCode is a novel reinforcement learning framework that significantly enhances large language models' ability to generate reliable code in dynamic API environments by training them on version migration data while minimizing impact on their general coding capabilities.
Authors:Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Huijiang Wang, Jiayu Chen, Xin Jin, Bo Li, Hua Chen, Wei Zhang, Wenjun Zeng
Abstract:
Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and long-term list of BFM papers and projects to facilitate more subsequent research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.
中文: 人形机器人在全身控制上面临挑战,但行为基础模型(BFMs)通过预训练可复用技能实现广泛任务适应,本文综述了其发展、应用与前景,并提供了相关资源以推动后续研究。
English: Humanoid robots face challenges in whole-body control, but behavior foundation models (BFMs) offer a promising solution by enabling adaptable, pre-trained skills for diverse tasks, as outlined in this comprehensive review of their development, applications, and future potential.
Authors:Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping
Abstract:
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing $\sim25\%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
中文: 本研究提出了一种新颖的大语言模型分层压缩策略,通过分层移除、选择和合并的组合操作,在减少约25%参数的同时保持约97.3%的原始性能,显著优于现有方法。
English: This study introduces a novel layer-based compression strategy for large language models that combines layer removal, selection, and merging to reduce model size by approximately 25% while retaining about 97.3% of original performance, significantly outperforming existing methods.
Authors:Tianyao Shi, Ritbik Kumar, Inez Hua, Yi Ding
Abstract:
Biodiversity loss is a critical planetary boundary, yet its connection to computing remains largely unexamined. Prior sustainability efforts in computing have focused on carbon and water, overlooking biodiversity due to the lack of appropriate metrics and modeling frameworks. This paper presents the first end-to-end analysis of biodiversity impact from computing systems. We introduce two new metrics--Embodied Biodiversity Index (EBI) and Operational Biodiversity Index (OBI)--to quantify biodiversity impact across the lifecycle, and present FABRIC, a modeling framework that links computing workloads to biodiversity impacts. Our evaluation highlights the need to consider biodiversity alongside carbon and water in sustainable computing design and optimization. The code is available at https://github.com/TianyaoShi/FABRIC.
中文摘要:本文首次对计算系统的生物多样性影响进行全面分析,提出了两个新指标和FABRIC建模框架,用以量化计算工作负载在整个生命周期中对生物多样性的影响。
English Summary: This paper introduces the first comprehensive analysis of computing's biodiversity impact, proposing two novel metrics and the FABRIC modeling framework to quantify and link computing workloads with biodiversity effects throughout their lifecycle.
Authors:Fangyijie Wang, Yuan Liang, Sourav Bhattacharjee, Abey Campbell, Kathleen M. Curran, Guénolé Silvestre
Abstract:
Accurate gestational age (GA) estimation, ideally through fetal ultrasound measurement, is a crucial aspect of providing excellent antenatal care. However, deriving GA from manual fetal biometric measurements depends on the operator and is time-consuming. Hence, automatic computer-assisted methods are demanded in clinical practice. In this paper, we present a novel feature fusion framework to estimate GA using fetal ultrasound images without any measurement information. We adopt a deep learning model to extract deep representations from ultrasound images. We extract radiomic features to reveal patterns and characteristics of fetal brain growth. To harness the interpretability of radiomics in medical imaging analysis, we estimate GA by fusing radiomic features and deep representations. Our framework estimates GA with a mean absolute error of 8.0 days across three trimesters, outperforming current machine learning-based methods at these gestational ages. Experimental results demonstrate the robustness of our framework across different populations in diverse geographical regions. Our code is publicly available on \href{https://github.com/13204942/RadiomicsImageFusion_FetalUS}.
中文摘要:本文提出了一种新颖的特征融合框架,通过结合胎儿超声图像的深度学习表征和影像组学特征来准确估算孕周,在不同地理区域的人群中实现了仅8.0天的平均绝对误差,性能优于现有方法。
English summary: This paper introduces a novel feature fusion framework that combines deep learning representations with radiomic features from fetal ultrasound images to accurately estimate gestational age, achieving superior performance with a mean absolute error of 8.0 days across diverse populations.
Authors:Francesco Carzaniga, Michael Hersche, Abu Sebastian, Kaspar Schindler, Abbas Rahimi
Abstract:
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the SWEC iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art Transformer baselines in seizure detection across the SWEC, the MAYO, and the FNUSA datasets, while also achieving state-of-the-art performance on four Brain TreeBank iEEG decoding tasks. We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds the performance of existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance. The code is available at https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG dataset is available at https://huggingface.co/datasets/NeuroTec/SWEC_iEEG_Dataset.
中文: 本研究提出多元并行注意力机制和MVPFormer模型,能有效处理异构时间序列数据如颅内脑电图,在多项临床任务和数据集上实现顶尖性能,并发布了最大的公开iEEG数据集以推动相关研究。
English: The study introduces a multi-variate parallel attention (MVPA) mechanism and MVPFormer model, which effectively handle heterogeneous time-series data like iEEG, achieving state-of-the-art performance across multiple clinical tasks and datasets while releasing the largest public iEEG dataset to support further research.
Authors:Andrej LúÄny, Matilde Antonj, Carlo Mazzola, Hana HornáÄková, Igor FarkaÅ¡
Abstract:
We introduce a neural network approach for generating and customizing the trajectory of a robotic arm, that guarantees precision and repeatability. To highlight the potential of this novel method, we describe the design and implementation of the technique and show its application in an experimental setting of cognitive robotics. In this scenario, the NICO robot was characterized by the ability to point to specific points in space with precise linear movements, increasing the predictability of the robotic action during its interaction with humans. To achieve this goal, the neural network computes the forward kinematics of the robot arm. By integrating it with a generator of joint angles, another neural network was developed and trained on an artificial dataset created from suitable start and end poses of the robotic arm. Through the computation of angular velocities, the robot was characterized by its ability to perform the movement, and the quality of its action was evaluated in terms of shape and accuracy. Thanks to its broad applicability, our approach successfully generates precise trajectories that could be customized in their shape and adapted to different settings.
中文摘要:本研究提出一种基于神经网络的机械臂轨迹生成与定制方法,通过认知机器人实验验证了NICO机器人能够执行精确的线性指向动作,展现出良好的轨迹精度和适应性。
English Summary: This study presents a neural network-based method for generating and customizing precise robotic arm trajectories, validated through cognitive robotics experiments where the NICO robot demonstrated accurate pointing movements.
Authors:Kun Yuan, Tingxuan Chen, Shi Li, Joel L. Lavanchy, Christian Heiliger, Ege Ãzsoy, Yiming Huang, Long Bai, Nassir Navab, Vinkle Srivastav, Hongliang Ren, Nicolas Padoy
Abstract:
The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA
中文: SPA框架通过少量样本的空间对齐、时序一致性建模和动态测试时自适应,以轻量级方式提升手术基础模型在跨机构环境中的性能,实现了仅需少量标注即可达到最优的手术阶段识别效果。
English: The SPA framework introduces a lightweight adaptation method that enhances surgical foundation models' performance in cross-institutional settings through few-shot spatial alignment, temporal consistency modeling, and dynamic test-time adaptation, achieving state-of-the-art phase recognition with minimal annotations.
Authors:Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, Mingli Song
Abstract:
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experimental results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page is available at: https://github.com/Thecommonirin/Qresafe.
中文: 本文对量化大语言模型进行了全面的安全评估,并提出Q-resafe这一量化感知安全补丁框架,能在不影响实用性的前提下有效恢复模型的安全防护能力。
English: This paper introduces comprehensive safety evaluations of quantized large language models (LLMs) and proposes Q-resafe, a quantization-aware safety patching framework that effectively restores safety capabilities without compromising utility.
Authors:Siqiao Li, Chen Hui, Wei Zhang, Rui Liang, Chenyue Song, Feng Jiang, Haiqi Zhu, Zhixuan Li, Hong Huang, Xiang Li
Abstract:
Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both low-level features like distortions and high-level features like organ anatomical structures affect the diagnostic value of the image. However, existing medical image quality assessment (IQA) methods are unable to account for both feature types simultaneously. In this work, we propose MS-IQA, a novel multi-scale feature fusion network for PET/CT IQA, which utilizes multi-scale features from various intermediate layers of ResNet and Swin Transformer, enhancing its ability of perceiving both local and global information. In addition, a multi-scale feature fusion module is also introduced to effectively combine high-level and low-level information through a dynamically weighted channel attention mechanism. Finally, to fill the blank of PET/CT IQA dataset, we construct PET-CT-IQA-DS, a dataset containing 2,700 varying-quality PET/CT images with quality scores assigned by radiologists. Experiments on our dataset and the publicly available LDCTIQAC2023 dataset demonstrate that our proposed model has achieved superior performance against existing state-of-the-art methods in various IQA metrics. This work provides an accurate and efficient IQA method for PET/CT. Our code and dataset are available at https://github.com/MS-IQA/MS-IQA/.
中文:本研究提出的MS-IQA模型通过动态融合多尺度特征,有效提升了PET/CT图像质量评估能力,在新构建的数据集上展现出优越性能。
English: The proposed MS-IQA model effectively enhances PET/CT image quality assessment by integrating multi-scale features through a dynamic fusion mechanism, demonstrating superior performance on a newly constructed dataset.
Authors:Deepak Ghimire, Kilho Lee, Seong-heum Kim
Abstract:
Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages: 1) training, 2) pruning, and 3) fine-tuning, whereas the proposed pruning technique adopts a pruning-while-training approach that eliminates the first stage and integrates the second and third stages into a single cycle. The automatic selection of magnitude or similarity-based filter pruning criteria from a specified pool of criteria and the specific pruning layer at each pruning iteration is guided by the network's overall loss on a small subset of the training data. To mitigate the abrupt accuracy drop due to pruning, the network is retrained briefly after each reduction of a predefined number of floating-point operations (FLOPs). The optimal pruning rates for each layer in the network are automatically determined, eliminating the need for manual allocation of fixed or variable pruning rates for each layer. Experiments on the VGGNet and ResNet models on the CIFAR-10 and ImageNet benchmark datasets demonstrate the effectiveness of the proposed method. In particular, the ResNet56 and ResNet110 models on the CIFAR-10 dataset significantly improve the top-1 accuracy compared to state-of-the-art methods while reducing the network FLOPs by 52\%. Furthermore, the ResNet50 model on the ImageNet dataset reduces FLOPs by more than 42\% with a negligible 0.33\% drop in top-5 accuracy. The source code of this paper is publicly available online - https://github.com/ghimiredhikura/laasp.
Chinese: 本文提出LAASP,一种高效的损失感知自动结构化剪枝方法,通过将剪枝与训练相结合来精简和加速深度神经网络,在基准数据集上实现了显著的计算量减少且精度损失极小。
English: The paper introduces LAASP, an efficient loss-aware automatic structured pruning method that integrates pruning with training to slim and accelerate deep neural networks, achieving significant FLOPs reduction with minimal accuracy loss on benchmark datasets.
Authors:Haipeng Fan, Shiyuan Zhang, Baohunesitu, Zihang Guo, Huaiwen Zhang
Abstract:
Autoregressive (AR) models have achieved unified and strong performance across both visual understanding and image generation tasks. However, removing undesired concepts from AR models while maintaining overall generation quality remains an open challenge. In this paper, we propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models. Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives, and Thresholded Loss Masking (TLM) strategy to protect content unrelated to the target concept during fine-tuning. Furthermore, we propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models. Specifically, we first employ structured templates across diverse large language models (LLMs) to pre-generate a large-scale corpus of target-replacement concept prompt pairs. Subsequently, we generate images from these prompts and subject them to rigorous filtering via a visual classifier to ensure concept fidelity and alignment. Extensive experimental results conducted on the ECGVF benchmark with the AR model Janus-Pro demonstrate that EAR achieves marked improvements in both erasure effectiveness and model utility preservation. Code is available at: https://github.com/immc-lab/ear/
Chinese: 本文提出擦除自回归模型(EAR),一种通过窗口梯度累积和阈值损失掩码等策略,在自回归模型中有效消除指定概念同时保持生成质量的微调方法,并在新基准测试中验证了其显著提升。
English: This paper introduces the Erasure Autoregressive Model (EAR), a fine-tuning method that effectively removes unwanted concepts from autoregressive models while preserving generation quality through novel strategies like Windowed Gradient Accumulation and Thresholded Loss Masking, validated on a new benchmark showing significant improvements.
Authors:Songsoo Kim, Seungtae Lee, See Young Lee, Joonho Kim, Keechan Kan, Dukyong Yoon
Abstract:
Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.
中文:三阶段大语言模型框架显著提高了放射学报告校对中的阳性预测值并降低了运营成本,同时保持稳定的检测性能,为人工智能辅助的质量保证提供了有效策略。
English: A three-pass LLM framework significantly improves the positive predictive value and reduces operational costs for radiology report proofreading while maintaining stable detection performance, offering an effective AI-assisted quality assurance strategy.
Authors:Jiahui Wu, Tiecheng Sun, Fucai Luo, Haiyan Wang, Weizhe Zhang
Abstract:
Multi-Key Homomorphic Encryption (MKHE), proposed by Lopez-Alt et al. (STOC 2012), allows for performing arithmetic computations directly on ciphertexts encrypted under distinct keys. Subsequent works by Chen and Dai et al. (CCS 2019) and Kim and Song et al. (CCS 2023) extended this concept by proposing multi-key BFV/CKKS variants, referred to as the CDKS scheme. These variants incorporate asymptotically optimal techniques to facilitate secure computation across multiple data providers. In this paper, we identify a critical security vulnerability in the CDKS scheme when applied to multiparty secure computation tasks, such as privacy-preserving federated learning (PPFL). In particular, we show that CDKS may inadvertently leak plaintext information from one party to others. To mitigate this issue, we propose a new scheme, SMHE (Secure Multi-Key Homomorphic Encryption), which incorporates a novel masking mechanism into the multi-key BFV and CKKS frameworks to ensure that plaintexts remain confidential throughout the computation. We implement a PPFL application using SMHE and demonstrate that it provides significantly improved security with only a modest overhead in homomorphic evaluation. For instance, our PPFL model based on multi-key CKKS incurs less than a 2\times runtime and communication traffic increase compared to the CDKS-based PPFL model. The code is publicly available at https://github.com/JiahuiWu2022/SMHE.git.
Chinese: CDKS多密钥同态加密方案被发现存在泄露各方明文的漏洞,因此提出了SMHE方案,通过引入掩蔽机制在保证安全的同时仅带来轻微性能损耗。
English: The CDKS multi-key homomorphic encryption scheme is found to have a security flaw that leaks plaintexts between parties, prompting the development of SMHE with a masking mechanism that enhances security with minimal performance overhead.
Authors:Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Abstract:
Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA
Chinese Summary: ToSA是一种新颖的令牌合并方法,通过结合语义和空间感知来指导令牌合并过程,在保持关键场景结构的同时显著提升视觉Transformer的运行效率,并在多个基准测试中优于现有方法。
English Summary: ToSA is a novel token merging method that enhances Vision Transformer acceleration by integrating both semantic and spatial awareness, achieving superior performance on visual tasks while significantly reducing computational runtime.
Authors:Hirad Daneshvar, Reza Samavi
Abstract:
Graph Neural Networks (GNNs) have shown remarkable performance in the healthcare domain. However, what remained challenging is quantifying the predictive uncertainty of GNNs, which is an important aspect of trustworthiness in clinical settings. While Bayesian and ensemble methods can be used to quantify uncertainty, they are computationally expensive. Additionally, the disagreement metric used by ensemble methods to compute uncertainty cannot capture the diversity of models in an ensemble network. In this paper, we propose a novel method, based on knowledge distillation, to quantify GNNs' uncertainty more efficiently and with higher precision. We apply self-distillation, where the same network serves as both the teacher and student models, thereby avoiding the need to train several networks independently. To ensure the impact of self-distillation, we develop an uncertainty metric that captures the diverse nature of the network by assigning different weights to each GNN classifier. We experimentally evaluate the precision, performance, and ability of our approach in distinguishing out-of-distribution data on two graph datasets: MIMIC-IV and Enzymes. The evaluation results demonstrate that the proposed method can effectively capture the predictive uncertainty of the model while having performance similar to that of the MC Dropout and ensemble methods. The code is publicly available at https://github.com/tailabTMU/UQ_GNN.
中文: 本文提出了一种基于知识蒸馏的新方法,通过自蒸馏和加权不确定性指标来高效量化图神经网络的预测不确定性,在医疗数据集上取得了与现有方法相当的性能。
English: This paper introduces a knowledge distillation-based method that efficiently quantifies Graph Neural Networks' predictive uncertainty using self-distillation and a weighted uncertainty metric, achieving comparable performance to existing methods on healthcare datasets.
Authors:Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu
Abstract:
Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional chaotic systems predict future snapshots one-by-one. This common approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to such systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5^\circ resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based sequence generation problems where modeling escalating uncertainty is paramount. Code is available at: https://github.com/salvaRC/erdm
Chinese: 作者提出了Elucidated Rolling Diffusion Models (ERDM)框架,通过将滚动预测结构与精细化扩散模型相结合,有效解决了混沌系统中不确定性递增的建模难题,在流体模拟和全球天气预报任务中均优于现有基线方法。
English: The authors introduce Elucidated Rolling Diffusion Models (ERDM), a novel framework that integrates rolling forecast mechanisms with advanced diffusion techniques to better model escalating uncertainty in chaotic systems, outperforming existing methods in both fluid dynamics simulations and global weather forecasting.
Authors:Haochen Zhang, Tianyi Zhang, Junze Yin, Oren Gal, Anshumali Shrivastava, Vladimir Braverman
Abstract:
Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation tasks do not fully leverage their sequential information processing capabilities, leading to suboptimal performance.
In this paper, we propose a novel system called compressed vocabulary expansion (CoVE). In CoVE, each item is assigned a unique ID within the expanded vocabulary. Our framework effectively capitalizes on sequence understanding abilities of LLMs, significantly enhancing their performance on recommendation tasks. Additionally, we compress the embedding layer, making CoVE practical for large-scale industrial applications. The effectiveness and performance of CoVE are demonstrated through comprehensive experiments on multiple recommendation datasets and comparisons with prior works. Our code can be found at https://github.com/HaochenZhang717/CoVE-official-Repo.
中文: 本文提出CoVE系统,通过压缩词汇扩展和嵌入层压缩,充分利用大语言模型的序列理解能力来增强推荐系统性能,在多数据集实验中展现出卓越效果。
English: This paper introduces CoVE, a novel system that enhances recommender systems by leveraging large language models' sequential processing capabilities through compressed vocabulary expansion and embedding layer compression, demonstrating superior performance in comprehensive experiments.
Authors:Hang Zhang, Yuxi Zhang, Jiazheng Wang, Xiang Chen, Renjiu Hu, Xin Tian, Gaolei Li, Min Liu
Abstract:
Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achieve higher accuracy in such scenarios, they are considerably slower than learning-based methods. To address these limitations, we propose VoxelOpt, a discrete optimization-based DIR framework that combines the strengths of learning-based and iterative methods to achieve a better balance between registration accuracy and runtime. VoxelOpt uses displacement entropy from local cost volumes to measure displacement signal strength at each voxel, which differs from earlier approaches in three key aspects. First, it introduces voxel-wise adaptive message passing, where voxels with lower entropy receives less influence from their neighbors. Second, it employs a multi-level image pyramid with 27-neighbor cost volumes at each level, avoiding exponential complexity growth. Third, it replaces hand-crafted features or contrastive learning with a pretrained foundational segmentation model for feature extraction. In abdominal CT registration, these changes allow VoxelOpt to outperform leading iterative in both efficiency and accuracy, while matching state-of-the-art learning-based methods trained with label supervision. The source code will be available at https://github.com/tinymilky/VoxelOpt
中文:VoxelOpt是一种新型可变形图像配准框架,融合了基于学习和迭代方法的优势,通过位移熵和预训练分割模型,在腹部CT配准中实现了更高的精度和效率。
English: VoxelOpt is a novel deformable image registration framework that integrates learning-based and iterative methods, utilizing displacement entropy and a pretrained segmentation model to achieve superior accuracy and efficiency in abdominal CT registration.
Authors:Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma
Abstract:
Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ($\sim25\times$) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.
中文摘要:本研究在仅解码器架构下比较掩码扩散模型与自回归模型,发现掩码扩散模型能实现显著生成加速且保持性能,同时强调了区分核心范式差异与架构影响的重要性。
English Summary: This study compares masked diffusion models (MDMs) with autoregressive models using a decoder-only architecture, revealing that MDMs can achieve significant generation speedups while maintaining performance, and highlights the importance of separating paradigm differences from architectural influences.
Authors:Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
Abstract:
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.
Chinese: 本文提出径向注意力机制,通过利用时空能量衰减现象,在保持视频质量的同时显著降低了视频扩散模型的计算复杂度,实现了高效的训练和推理。
English: This paper introduces Radial Attention, a scalable sparse attention mechanism that reduces computational complexity in video diffusion models by leveraging spatiotemporal energy decay, achieving significant efficiency gains while maintaining video quality.
Authors:Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin
Abstract:
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.
中文: ScaleCap是一种推理时可扩展的图像描述策略,通过启发式问答和对比语句评分解决多模态和语言偏见,随着推理成本增加逐步生成更准确、平衡且信息丰富的图像描述。
English: ScaleCap is an inference-time scalable image captioning strategy that addresses multimodal and linguistic biases in LVLMs through heuristic question answering and contrastive sentence rating, progressively generating more accurate and detailed captions with increased inference cost.
Authors:Tengbo Yu, Guanxing Lu, Zaijia Yang, Haoyuan Deng, Season Si Chen, Jiwen Lu, Wenbo Ding, Guoqiang Hu, Yansong Tang, Ziwei Wang
Abstract:
Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.
中文摘要:ManiGaussian++通过分层高斯世界模型改进了多体场景动态理解,在双手操作任务中相比现有技术实现了20.2%的性能提升,并在真实世界任务中达到60%的平均成功率。
English Summary: ManiGaussian++ extends the original framework with a hierarchical Gaussian world model to better capture multi-body dynamics in bimanual manipulation, achieving significant performance improvements in both simulated and real-world tasks.
Authors:Yucheng Zhou, Lingran Song, Jianbing Shen
Abstract:
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.
中文: 模块化多智能体框架(MAM)通过分配专业角色优化医疗诊断流程,在多种模态医学数据集上实现了18%至365%的性能提升。
English: The Modular Multi-Agent Framework (MAM) introduces specialized LLM-based roles to enhance medical diagnosis, achieving performance improvements of 18% to 365% over baseline models across diverse multimodal datasets.
Authors:Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
Abstract:
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $γ$-token guess is correct falls exponentially as $γ$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
Chinese: 前瞻推理通过引入步骤级并行性改进了推测解码,允许草稿模型提出多个未来推理步骤并验证其语义正确性,与令牌级并行性相乘,从而在不影响答案质量的前提下显著提升了解码速度。
English: Lookahead Reasoning enhances speculative decoding by introducing step-level parallelism, allowing a draft model to propose multiple future reasoning steps and verifying their semantic correctness, which multiplies with token-level parallelism to significantly boost decoding speed without compromising answer quality.
Authors:Yitao Peng, Lianghua He, Hongzhou Chen
Abstract:
Although interpretable prototype networks have improved the transparency of deep learning image classification, the need for multiple prototypes in collaborative decision-making increases cognitive complexity and hinders user understanding. To solve this problem, this paper proposes a novel interpretable deep architecture for image classification, called ProtoSolo. Unlike existing prototypical networks, ProtoSolo requires activation of only a single prototype to complete the classification. This design significantly simplifies interpretation, as the explanation for each class requires displaying only the prototype with the highest similarity score and its corresponding feature map. Additionally, the traditional full-channel feature vector is replaced with a feature map for similarity comparison and prototype learning, enabling the use of richer global information within a single-prototype activation decision. A non-projection prototype learning strategy is also introduced to preserve the association between the prototype and image patch while avoiding abrupt structural changes in the network caused by projection, which can affect classification performance. Experiments on the CUB-200-2011 and Stanford Cars datasets demonstrate that ProtoSolo matches state-of-the-art interpretable methods in classification accuracy while achieving the lowest cognitive complexity. The code is available at https://github.com/pyt19/ProtoSolo.
中文摘要:本文提出ProtoSolo这一新型可解释深度架构,通过仅激活单个原型完成图像分类决策,在保持竞争力的分类准确率同时显著降低了认知复杂度。
English Summary: This paper introduces ProtoSolo, a novel interpretable deep architecture that simplifies image classification by activating only a single prototype per decision, thereby reducing cognitive complexity while maintaining competitive accuracy.
Authors:Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
Chinese: 针对慢思考大语言模型的严重幻觉问题,我们提出KnowRL方法,通过知识增强的强化学习引入事实性奖励机制,引导模型进行基于事实的慢思考,在保持推理能力的同时有效减少错误输出。
English: To address severe hallucinations in slow-thinking Large Language Models, we propose KnowRL, a knowledge-enhanced reinforcement learning method that integrates factuality rewards to guide fact-based reasoning and reduce incorrect outputs while preserving reasoning capabilities.
Authors:Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
Chinese: 针对慢思考大语言模型的严重幻觉问题,我们提出KnowRL方法,通过知识增强的强化学习引入事实性奖励机制,引导模型进行基于事实的慢思考,在保持推理能力的同时有效减少错误输出。
English: To address severe hallucinations in slow-thinking Large Language Models, we propose KnowRL, a knowledge-enhanced reinforcement learning method that integrates factuality rewards to guide fact-based reasoning and reduce incorrect outputs while preserving reasoning capabilities.
Authors:Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen
Abstract:
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.
中文摘要:本研究通过揭示战略规划是性能关键驱动因素,开发了一种数据合成方法,显著提升了开源大语言模型的分析推理能力。
English Summary: This study enhances open-source LLMs' data analysis capabilities by identifying strategic planning as the key performance driver and developing a data synthesis method that significantly improves analytical reasoning.
Authors:Boyi Liu, Qianyi Zhang, Qiang Yang, Jianhao Jiao, Jagmohan Chauhan, Dimitrios Kanoulas
Abstract:
The integration of satellite communication into mobile devices represents a paradigm shift in connectivity, yet the performance characteristics under motion and environmental occlusion remain poorly understood. We present the Starlink Robot, the first mobile robotic platform equipped with Starlink satellite internet, comprehensive sensor suite including upward-facing camera, LiDAR, and IMU, designed to systematically study satellite communication performance during movement. Our multi-modal dataset captures synchronized communication metrics, motion dynamics, sky visibility, and 3D environmental context across diverse scenarios including steady-state motion, variable speeds, and different occlusion conditions. This platform and dataset enable researchers to develop motion-aware communication protocols, predict connectivity disruptions, and optimize satellite communication for emerging mobile applications from smartphones to autonomous vehicles. In this work, we use LEOViz for real-time satellite tracking and data collection. The starlink robot project is available at https://github.com/StarlinkRobot.
Chinese: Starlink机器人是首个配备星链卫星互联网和全面传感器的移动平台,旨在系统研究运动和环境遮挡下的卫星通信性能,其多模态数据集为开发运动感知协议和优化移动应用的连接性提供了支持。
English: The Starlink Robot is a pioneering mobile platform that uses Starlink satellite internet and integrated sensors to systematically analyze satellite communication performance during movement and under environmental obstructions, providing a dataset for developing motion-aware protocols and optimizing connectivity for mobile applications.
Authors:Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao
Abstract:
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences. Reinforcement Learning from Human Feedback (RLHF) addresses this by optimizing models toward human preferences using a learned reward function and reinforcement learning, yielding improved alignment but suffering from high computational cost and instability. Direct Preference Optimization (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs, reducing training overhead while achieving competitive performance. However, it assumes fixed, single-dimensional preferences and only supports pairwise supervision.
To address these limitations, we propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback and flexibly balance multiple goals such as helpfulness, honesty, and fluency. Our method models full-ranked preference distributions rather than binary comparisons, enabling more informative learning signals. The lambda vector controls the relative importance of different alignment goals, allowing the model to generalize across diverse human objectives. During inference, lambda can be adjusted without retraining, providing controllable alignment behavior for downstream use. We also introduce a learned scheduler that dynamically samples performant lambda configurations to improve robustness.
Notably, our method requires only 20GB of GPU memory for training, making it suitable for compute-constrained settings such as academic labs, educational tools, or on-device assistants. Experiments on 1B-2B scale models show that our method consistently outperforms standard DPO on alignment benchmarks while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
中文: 提出的多偏好Lambda加权列表化DPO方法通过列表化偏好建模和可调节的lambda向量,使语言模型能从更细致的人类反馈中学习并灵活平衡多个对齐目标,在低计算需求下实现更优性能。
English: The proposed Multi-Preference Lambda-weighted Listwise DPO method enables language models to learn from detailed human feedback and flexibly balance multiple alignment goals through listwise preference modeling and adjustable lambda vectors, achieving superior performance with low computational requirements.
Authors:Yihong Luo, Shuchen Xue, Tianyang Hu, Jing Tang
Abstract:
The pursuit of efficient and controllable high-quality content generation remains a central challenge in artificial intelligence-generated content (AIGC). While one-step generators, enabled by diffusion distillation techniques, offer excellent generation quality and computational efficiency, adapting them to new control conditions--such as structural constraints, semantic guidelines, or external inputs--poses a significant challenge. Conventional approaches often necessitate computationally expensive modifications to the base model and subsequent diffusion distillation. This paper introduces Noise Consistency Training (NCT), a novel and lightweight approach to directly integrate new control signals into pre-trained one-step generators without requiring access to original training images or retraining the base diffusion model. NCT operates by introducing an adapter module and employs a noise consistency loss in the noise space of the generator. This loss aligns the adapted model's generation behavior across noises that are conditionally dependent to varying degrees, implicitly guiding it to adhere to the new control. Theoretically, this training objective can be understood as minimizing the distributional distance between the adapted generator and the conditional distribution induced by the new conditions. NCT is modular, data-efficient, and easily deployable, relying only on the pre-trained one-step generator and a control signal model. Extensive experiments demonstrate that NCT achieves state-of-the-art controllable generation in a single forward pass, surpassing existing multi-step and distillation-based methods in both generation quality and computational efficiency. Code is available at https://github.com/Luo-Yihong/NCT
中文: 本文提出的噪声一致性训练(NCT)是一种轻量级方法,可在无需重新训练的情况下将新控制信号集成到预训练单步生成器中,实现了在生成质量和计算效率方面均超越现有方法的最优可控生成效果。
English: This paper introduces Noise Consistency Training (NCT), a lightweight method that enables pre-trained one-step generators to incorporate new control signals without retraining, achieving state-of-the-art controllable generation with superior quality and efficiency.
Authors:Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, Daniel Fried
Abstract:
Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agents' ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in difficulty by varying the number of missing functions $n$, ranging from partial reproduction to full replication. We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases. Agents that can dynamically interact with the environment (e.g. to debug their code) can outperform agents in fixed "agentless" harnesses, and there exists a significant gap between single-shot and multi-trial success rates (Pass@1 vs. Pass@5), motivating verifier approaches to our benchmark. Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution, establishing AutoExperiment as a new benchmark for evaluating progress in AI-driven scientific experimentation. Our data and code are open-sourced at https://github.com/j1mk1m/AutoExperiment .
中文: AutoExperiment是一个新基准,用于评估AI代理根据自然语言描述实现和运行机器学习实验的能力,其性能随着需生成代码量的增加而下降,且交互式代理优于静态代理。
English: AutoExperiment is a new benchmark that assesses AI agents' ability to implement and run machine learning experiments from natural language descriptions, with performance declining as more code must be generated from scratch and interactive agents outperforming static ones.
Authors:Lei Kang, Xuanshuo Fu, Oriol Ramos Terrades, Javier Vazquez-Corral, Ernest Valveny, Dimosthenis Karatzas
Abstract:
Medical document analysis plays a crucial role in extracting essential clinical insights from unstructured healthcare records, supporting critical tasks such as differential diagnosis. Determining the most probable condition among overlapping symptoms requires precise evaluation and deep medical expertise. While recent advancements in large language models (LLMs) have significantly enhanced performance in medical document analysis, privacy concerns related to sensitive patient data limit the use of online LLMs services in clinical settings. To address these challenges, we propose a trustworthy medical document analysis platform that fine-tunes a LLaMA-v3 using low-rank adaptation, specifically optimized for differential diagnosis tasks. Our approach utilizes DDXPlus, the largest benchmark dataset for differential diagnosis, and demonstrates superior performance in pathology prediction and variable-length differential diagnosis compared to existing methods. The developed web-based platform allows users to submit their own unstructured medical documents and receive accurate, explainable diagnostic results. By incorporating advanced explainability techniques, the system ensures transparent and reliable predictions, fostering user trust and confidence. Extensive evaluations confirm that the proposed method surpasses current state-of-the-art models in predictive accuracy while offering practical utility in clinical settings. This work addresses the urgent need for reliable, explainable, and privacy-preserving artificial intelligence solutions, representing a significant advancement in intelligent medical document analysis for real-world healthcare applications. The code can be found at \href{https://github.com/leitro/Differential-Diagnosis-LoRA}{https://github.com/leitro/Differential-Diagnosis-LoRA}.
中文: 本研究开发了一个基于低秩自适应微调LLaMA-v3的可信医疗文档分析平台,通过可解释人工智能在保护数据隐私的前提下,实现了卓越的鉴别诊断性能。
English: This study introduces a privacy-preserving medical document analysis platform that fine-tunes LLaMA-v3 using low-rank adaptation, achieving superior differential diagnosis performance through explainable AI while ensuring data security.
Authors:Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
Abstract:
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
中文: 离群值安全预训练(OSP)通过三项关键创新主动预防大语言模型中的极端激活离群值,在激进4位量化下实现卓越性能且仅增加2%训练开销,从根本上改变了模型量化行为。
English: Outlier-Safe Pre-Training (OSP) is a novel training strategy that proactively prevents extreme activation outliers in LLMs through three key innovations, enabling superior 4-bit quantization performance with minimal training overhead and fundamentally changing LLM deployment efficiency.
Authors:Mihnea Ghitu, Vihari Piratla, Matthew Wicker
Abstract:
Controlling the patterns a model learns is essential to preventing reliance on irrelevant or misleading features. Such reliance on irrelevant features, often called shortcut features, has been observed across domains, including medical imaging and natural language processing, where it may lead to real-world harms. A common mitigation strategy leverages annotations (provided by humans or machines) indicating which features are relevant or irrelevant. These annotations are compared to model explanations, typically in the form of feature salience, and used to guide the loss function during training. Unfortunately, recent works have demonstrated that feature salience methods are unreliable and therefore offer a poor signal to optimize. In this work, we propose a simplified objective that simultaneously optimizes for explanation robustness and mitigation of shortcut learning. Unlike prior objectives with similar aims, we demonstrate theoretically why our approach ought to be more effective. Across a comprehensive series of experiments, we show that our approach consistently reduces test-time misclassifications by 20% compared to state-of-the-art methods. We also extend prior experimental settings to include natural language processing tasks. Additionally, we conduct novel ablations that yield practical insights, including the relative importance of annotation quality over quantity. Code for our method and experiments is available at: https://github.com/Mihneaghitu/ModelGuidanceViaRobustFeatureAttribution.
Chinese: 本研究提出了一种简化目标,通过增强解释鲁棒性并缓解捷径学习,理论上证明了其有效性,实验表明测试时误分类减少20%,同时扩展至自然语言处理任务,并提供了关于标注质量优于数量的实用见解。
English: This study introduces a streamlined objective that enhances explanation robustness and mitigates shortcut learning, theoretically justifying its effectiveness and demonstrating a 20% reduction in test-time misclassifications across experiments, while also extending evaluations to NLP tasks and providing practical insights on annotation quality.
Authors:Yang Xing, Jiong Wu, Yuheng Bu, Kuang Gong
Abstract:
Although new vision foundation models such as Segment Anything Model 2 (SAM2) have significantly enhanced zero-shot image segmentation capabilities, reliance on human-provided prompts poses significant challenges in adapting SAM2 to medical image segmentation tasks. Moreover, SAM2's performance in medical image segmentation was limited by the domain shift issue, since it was originally trained on natural images and videos. To address these challenges, we proposed SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated the need for manual prompts. The proposed model leveraged the memory mechanism of SAM2 to generate pseudo-masks using image-mask pairs from a support set via a Pseudo-mask Generation (PMG) module. We further introduced a novel Pseudo-mask Attention (PMA) module, which used these pseudo-masks to automatically generate bounding boxes and enhance localized feature extraction by guiding attention to relevant areas. Furthermore, a low-rank adaptation (LoRA) strategy was adopted to mitigate the domain shift issue. The proposed framework was evaluated on both 2D and 3D datasets across multiple medical imaging modalities, including fundus photography, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The results demonstrated a significant performance improvement over state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation models, such as SAM2 and MedSAM2, underscoring the effectiveness of the proposed approach. Our code is publicly available at https://github.com/astlian9/SAM_Support.
中文: 提出的SAM2-SGP框架通过支持集生成伪掩码并利用注意力机制,无需人工提示即可完成医学图像分割,同时采用LoRA策略解决领域偏移问题,在多种影像模态中均实现了优越性能。
English: The proposed SAM2-SGP framework eliminates the need for manual prompts in medical image segmentation by generating pseudo-masks from support sets and using attention mechanisms, while employing LoRA to address domain shift, achieving superior performance across multiple imaging modalities.
Authors:Oscar J. Pellicer-Valero, Cesar Aybar, Gustau Camps Valls
Abstract:
Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library's effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo
中文: xarrayvideo Python库通过将地球系统数据集转换为视频格式,利用标准编解码器实现了高达250倍的压缩率,同时保持数据质量并确保其在机器学习任务中的可用性。
English: The xarrayvideo Python library compresses large Earth system datasets by converting them into videos using standard codecs, achieving up to 250x compression while preserving data quality and usability for machine learning tasks.
Authors:Gaurav Sharma, Ravi Kothari, Josef Schmid
Abstract:
In this paper, we propose a Neural Radiance Fields (NeRF) based framework, referred to as Novel View Synthesis Framework (NVSF). It jointly learns the implicit neural representation of space and time-varying scene for both LiDAR and Camera. We test this on a real-world autonomous driving scenario containing both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs, our framework is self-supervised, thus eliminating the need for 3D labels. For efficient training and faster convergence, we introduce heuristic-based image pixel sampling to focus on pixels with rich information. To preserve the local features of LiDAR points, a Double Gradient based mask is employed. Extensive experiments on the KITTI-360 dataset show that, compared to the baseline models, our framework has reported best performance on both LiDAR and Camera domain. Code of the model is available at https://github.com/gaurav00700/Selfsupervised-NVSF
Chinese: 本文提出了一种名为NVSF的自监督神经辐射场框架,通过启发式采样和双梯度掩码技术联合建模激光雷达与相机的动态场景,在KITTI-360数据集上实现了最佳性能。
English: This paper introduces a self-supervised Neural Radiance Fields framework called NVSF that jointly models dynamic scenes for LiDAR and camera data, achieving state-of-the-art performance on the KITTI-360 dataset through heuristic sampling and gradient-based feature preservation.
Authors:Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang
Abstract:
In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.
中文摘要:ECCoT框架通过MRF-ETM实现主题感知的思维链生成,并利用CSBert进行因果对齐验证,有效提升大语言模型推理链的可靠性,增强可解释性并减少偏见。
English Summary: The ECCoT framework enhances the reliability of Large Language Models by validating reasoning chains using MRF-ETM for topic-aware generation and CSBert for causal alignment, thereby improving interpretability and reducing biases.
Authors:Alan N. Amin, Andres Potapczynski, Andrew Gordon Wilson
Abstract:
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
中文摘要:遗传学家利用大规模基因组数据构建预测模型以关联基因变异与表型,但面临矩阵求逆的计算瓶颈;DeepWAS采用快速线性代数方法训练更大的神经网络,通过全似然优化提升预测性能,有望改进疾病预测和治疗靶点识别。
English Summary: Geneticists use large-scale genomic data to build predictive models linking genetic variants to phenotypes, but face computational bottlenecks from matrix inversions, which DeepWAS overcomes using fast linear algebra to train larger neural networks that improve predictions with full likelihood optimization.
Authors:Tao Huang, Zhekun Liu, Rui Wang, Yang Zhang, Liping Jing
Abstract:
Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs--a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.
中文摘要:大型视觉语言模型存在视觉幻觉导致可靠性风险,为此开发了PRE-HAL评估基准和基于Dempster-Shafer理论的检测方法,该方法在多项测试中优于现有基准指标。
English Summary: Large Vision-Language Models suffer from visual hallucinations causing reliability risks, prompting the creation of the PRE-HAL benchmark for systematic evaluation and a novel Dempster-Shafer theory-based detection method that outperforms existing metrics.
Authors:Aleksandr Algazinov, Matt Laing, Paul Laban
Abstract:
Accessibility remains a critical concern in today's society, as many technologies are not developed to support the full range of user needs. Existing multi-agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed-source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user's needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real-time user assistance. Furthermore, we introduce ModCon-Task-Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon-Task-Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi-Agent-MATE.
中文: MATE是一种多模态无障碍多智能体系统,能根据用户需求进行数据格式转换,通过可定制的模态转换帮助残障人士克服数字障碍,同时本地运行确保隐私安全。
English: MATE is a multimodal accessibility multi-agent system that converts data formats based on user needs, enabling people with disabilities to overcome digital barriers through customizable modality conversions while ensuring local operation for privacy protection.
Authors:Yuelin Zhang, Jiacheng Cen, Jiaqi Han, Wenbing Huang
Abstract:
Equivariant Graph Neural Networks (GNNs) have achieved remarkable success across diverse scientific applications. However, existing approaches face critical efficiency challenges when scaling to large geometric graphs and suffer significant performance degradation when the input graphs are sparsified for computational tractability. To address these limitations, we introduce FastEGNN and DistEGNN, two novel enhancements to equivariant GNNs for large-scale geometric graphs. FastEGNN employs a key innovation: a small ordered set of virtual nodes that effectively approximates the large unordered graph of real nodes. Specifically, we implement distinct message passing and aggregation mechanisms for different virtual nodes to ensure mutual distinctiveness, and minimize Maximum Mean Discrepancy (MMD) between virtual and real coordinates to achieve global distributedness. This design enables FastEGNN to maintain high accuracy while efficiently processing large-scale sparse graphs. For extremely large-scale geometric graphs, we present DistEGNN, a distributed extension where virtual nodes act as global bridges between subgraphs in different devices, maintaining consistency while dramatically reducing memory and computational overhead. We comprehensively evaluate our models across four challenging domains: N-body systems (100 nodes), protein dynamics (800 nodes), Water-3D (8,000 nodes), and our new Fluid113K benchmark (113,000 nodes). Results demonstrate superior efficiency and performance, establishing new capabilities in large-scale equivariant graph learning. Code is available at https://github.com/GLAD-RUC/DistEGNN.
Chinese: FastEGNN和DistEGNN通过虚拟节点有效近似大型几何图,在保持高精度的同时显著提升处理效率,成功应用于多种大规模科学计算场景。
English: FastEGNN and DistEGNN enhance equivariant GNNs by using virtual nodes to efficiently approximate large geometric graphs, maintaining high accuracy and scalability across diverse large-scale applications.
Authors:Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Junhang Zhang, Guang Yang, Lei Zhu
Abstract:
In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1.
中文: 本研究提出了首个用于手术视觉问答定位的推理多模态大语言模型Surgery-R1,采用两阶段微调方法和多模态一致性奖励机制,显著提升了推理能力并超越了现有最优模型。
English: This study introduces Surgery-R1, the first reasoning multimodal large language model for surgical visual question localized-answering, which employs a two-stage fine-tuning approach and a multimodal coherence reward mechanism to enhance reasoning capabilities and outperform existing models.
Authors:Ankita Raj, Harsh Swaika, Deepankar Varma, Chetan Arora
Abstract:
The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model's functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model's training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.
中文摘要:医疗影像中的深度学习模型易受模型窃取攻击,攻击者可通过代理数据集和有限查询克隆专有模型,为此提出的QueryWise方法能在无需额外查询下提升窃取效率。
English Summary: Deep learning models in medical imaging are vulnerable to model stealing attacks, where adversaries clone proprietary models using proxy datasets and limited queries, prompting the development of the QueryWise method to enhance theft efficiency without extra queries.
Authors:Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu, Kunlun He
Abstract:
Vascular diseases pose a significant threat to human health, with X-ray angiography established as the gold standard for diagnosis, allowing for detailed observation of blood vessels. However, angiographic X-rays expose personnel and patients to higher radiation levels than non-angiographic X-rays, which are unwanted. Thus, modality translation from non-angiographic to angiographic X-rays is desirable. Data-driven deep approaches are hindered by the lack of paired large-scale X-ray angiography datasets. While making high-quality vascular angiography synthesis crucial, it remains challenging. We find that current medical image synthesis primarily operates at pixel level and struggles to adapt to the complex geometric structure of blood vessels, resulting in unsatisfactory quality of blood vessel image synthesis, such as disconnections or unnatural curvatures. To overcome this issue, we propose a self-supervised method via diffusion models to transform non-angiographic X-rays into angiographic X-rays, mitigating data shortages for data-driven approaches. Our model comprises a diffusion model that learns the distribution of vascular data from diffusion latent, a generator for vessel synthesis, and a mask-based adversarial module. To enhance geometric accuracy, we propose a parametric vascular model to fit the shape and distribution of blood vessels. The proposed method contributes a pipeline and a synthetic dataset for X-ray angiography. We conducted extensive comparative and ablation experiments to evaluate the Angio-Diff. The results demonstrate that our method achieves state-of-the-art performance in synthetic angiography image quality and more accurately synthesizes the geometric structure of blood vessels. The code is available at https://github.com/zfw-cv/AngioDiff.
Chinese: 本研究提出了一种名为Angio-Diff的自监督扩散模型,能将非血管造影X光转换为血管造影图像,通过参数化血管模型和对抗训练有效解决了数据稀缺问题,显著提升了血管几何结构的合成精度。
English: This study introduces a self-supervised diffusion model called Angio-Diff that converts non-angiographic X-rays into angiographic images, effectively addressing data scarcity and improving geometric accuracy in blood vessel synthesis through a parametric vascular model and adversarial training.
Authors:Sajal Halder, Muhammad Ejaz Ahmed, Seyit Camtepe
Abstract:
Software supply chain vulnerabilities arise when attackers exploit weaknesses by injecting vulnerable code into widely used packages or libraries within software repositories. While most existing approaches focus on identifying vulnerable packages or libraries, they often overlook the specific functions responsible for these vulnerabilities. Pinpointing vulnerable functions within packages or libraries is critical, as it can significantly reduce the risks associated with using open-source software. Identifying vulnerable patches is challenging because developers often submit code changes that are unrelated to vulnerability fixes. To address this issue, this paper introduces FuncVul, an innovative code chunk-based model for function-level vulnerability detection in C/C++ and Python, designed to identify multiple vulnerabilities within a function by focusing on smaller, critical code segments. To assess the model's effectiveness, we construct six code and generic code chunk based datasets using two approaches: (1) integrating patch information with large language models to label vulnerable samples and (2) leveraging large language models alone to detect vulnerabilities in function-level code. To design FuncVul vulnerability model, we utilise GraphCodeBERT fine tune model that captures both the syntactic and semantic aspects of code. Experimental results show that FuncVul outperforms existing state-of-the-art models, achieving an average accuracy of 87-92% and an F1 score of 86-92% across all datasets. Furthermore, we have demonstrated that our code-chunk-based FuncVul model improves 53.9% accuracy and 42.0% F1-score than the full function-based vulnerability prediction. The FuncVul code and datasets are publicly available on GitHub at https://github.com/sajalhalder/FuncVul.
Chinese: 本文提出FuncVul模型,通过聚焦关键代码段实现C/C++和Python语言的函数级漏洞检测,其准确率和F1分数均显著优于现有方法。
English: This paper introduces FuncVul, a code chunk-based model that detects function-level vulnerabilities in C/C++ and Python by focusing on critical code segments, achieving superior accuracy and F1 scores compared to existing methods.
Authors:Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
Abstract:
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
中文: Mem4Nav提出了一种层次化空间认知记忆系统,通过融合细粒度体素索引与语义地标连通性来增强视觉语言导航智能体,在多个基准测试中实现了显著的性能提升。
English: Mem4Nav introduces a hierarchical spatial-cognition memory system that enhances Vision-and-Language Navigation agents by combining fine-grained voxel indexing with semantic landmark connectivity, achieving significant performance improvements across multiple benchmarks.
Authors:Robert Hanson, Jesus Martinez-Garcia
Abstract:
We describe CompGIT, a SageMath package to describe Geometric Invariant Theory (GIT) quotients of projective space by simple groups. The implementation is based on algorithms described by Gallardo--Martinez-Garcia--Moon--Swinarski. In principle the package is sufficient to describe any GIT quotient of a projective variety by a simple group -- in practice it requires that the user can construct an equivariant embedding of the polarised variety into projective space. The package describes the non-stable and unstable loci up to conjugation by the group, as well as describing the strictly polystable loci. We discuss potential applications of the outputs of CompGIT to algebraic geometry problems, a well as suggesting directions for future developments.
中文: CompGIT是一个SageMath软件包,用于计算简单群对射影空间的几何不变量理论商,能识别不稳定和多稳态轨迹,并支持代数几何中的应用。
English: CompGIT is a SageMath package that computes Geometric Invariant Theory quotients of projective space by simple groups, identifying unstable and polystable loci while supporting applications in algebraic geometry.
Authors:Tiankai Yang, Kaixin Chai, Jialin Ji, Yuze Wu, Chao Xu, Fei Gao
Abstract:
The ground effect on multicopters introduces several challenges, such as control errors caused by additional lift, oscillations that may occur during near-ground flight due to external torques, and the influence of ground airflow on models such as the rotor drag and the mixing matrix. This article collects and analyzes the dynamics data of near-ground multicopter flight through various methods, including force measurement platforms and real-world flights. For the first time, we summarize the mathematical model of the external torque of multicopters under ground effect. The influence of ground airflow on rotor drag and the mixing matrix is also verified through adequate experimentation and analysis. Through simplification and derivation, the differential flatness of the multicopter's dynamic model under ground effect is confirmed. To mitigate the influence of these disturbance models on control, we propose a control method that combines dynamic inverse and disturbance models, ensuring consistent control effectiveness at both high and low altitudes. In this method, the additional thrust and variations in rotor drag under ground effect are both considered and compensated through feedforward models. The leveling torque of ground effect can be equivalently represented as variations in the center of gravity and the moment of inertia. In this way, the leveling torque does not explicitly appear in the dynamic model. The final experimental results show that the method proposed in this paper reduces the control error (RMSE) by \textbf{45.3\%}. Please check the supplementary material at: https://github.com/ZJU-FAST-Lab/Ground-effect-controller.
中文: 本文分析了多旋翼飞行器的地面效应,建立了外部扭矩和气流影响的数学模型,并提出一种控制方法,通过前馈补偿和动态建模将控制误差降低了45.3%。
English: This article analyzes the ground effect on multicopters, develops a mathematical model for external torque and airflow impacts, and proposes a control method that reduces control error by 45.3% through feedforward compensation and dynamic modeling.
Authors:Yin Zhang, Zian Ning, Xiaoyu Zhang, Shiliang Guo, Peidong Liu, Shiyu Zhao
Abstract:
Existing micro aerial vehicle (MAV) detection methods mainly rely on the target's appearance features in RGB images, whose diversity makes it difficult to achieve generalized MAV detection. We notice that different types of MAVs share the same distinctive features in event streams due to their high-speed rotating propellers, which are hard to see in RGB images. This paper studies how to detect different types of MAVs from an event camera by fully exploiting the features of propellers in the original event stream. The proposed method consists of three modules to extract the salient and spatio-temporal features of the propellers while filtering out noise from background objects and camera motion. Since there are no existing event-based MAV datasets, we introduce a novel MAV dataset for the community. This is the first event-based MAV dataset comprising multiple scenarios and different types of MAVs. Without training, our method significantly outperforms state-of-the-art methods and can deal with challenging scenarios, achieving a precision rate of 83.0\% (+30.3\%) and a recall rate of 81.5\% (+36.4\%) on the proposed testing dataset. The dataset and code are available at: https://github.com/WindyLab/EvDetMAV.
Chinese: 本研究提出了一种基于事件的新型微飞行器检测方法,通过利用事件流中独特的螺旋桨特征,在无需训练的情况下实现了卓越性能,并提供了首个基于事件的微飞行器数据集。
English: This study introduces a novel event-based method for detecting micro aerial vehicles by leveraging the distinctive propeller features in event streams, achieving superior performance without training and providing the first event-based MAV dataset.
Authors:Shengkui Zhao, Zexu Pan, Bin Ma
Abstract:
This paper introduces ClearerVoice-Studio, an open-source, AI-powered speech processing toolkit designed to bridge cutting-edge research and practical application. Unlike broad platforms like SpeechBrain and ESPnet, ClearerVoice-Studio focuses on interconnected speech tasks of speech enhancement, separation, super-resolution, and multimodal target speaker extraction. A key advantage is its state-of-the-art pretrained models, including FRCRN with 3 million uses and MossFormer with 2.5 million uses, optimized for real-world scenarios. It also offers model optimization tools, multi-format audio support, the SpeechScore evaluation toolkit, and user-friendly interfaces, catering to researchers, developers, and end-users. Its rapid adoption attracting 3000 GitHub stars and 239 forks highlights its academic and industrial impact. This paper details ClearerVoice-Studio's capabilities, architectures, training strategies, benchmarks, community impact, and future plan. Source code is available at https://github.com/modelscope/ClearerVoice-Studio.
中文: ClearerVoice-Studio 是一款专注于语音增强、分离等互联任务的开源AI工具包,具备先进的预训练模型和实用工具,已在GitHub上获得3000星标,展现了广泛的学术和工业影响力。
English: ClearerVoice-Studio is an open-source AI toolkit specializing in speech enhancement, separation, and related tasks, featuring state-of-the-art pretrained models and tools that have gained significant traction with 3000 GitHub stars.
Authors:Declan J. Curran, Sanaa Hobeichi, Hira Saleem, Hao Xue, Flora D. Salim
Abstract:
Downscaling is essential for generating the high-resolution climate data needed for local planning, but traditional methods remain computationally demanding. Recent years have seen impressive results from AI downscaling models, particularly diffusion models, which have attracted attention due to their ability to generate ensembles and overcome the smoothing problem common in other AI methods. However, these models typically remain computationally intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model, which introduces an easily-extensible hierarchical sampling process to the diffusion framework. A coarse-to-fine hierarchy is imposed via a simple downsampling scheme. HDD achieves competitive accuracy on ERA5 reanalysis datasets and CMIP6 models, significantly reducing computational load by running on up to half as many pixels with competitive results. Additionally, a single model trained at 0.25° resolution transfers seamlessly across multiple CMIP6 models with much coarser resolution. HDD thus offers a lightweight alternative for probabilistic climate downscaling, facilitating affordable large-ensemble high-resolution climate projections. See a full code implementation at: https://github.com/HDD-Hierarchical-Diffusion-Downscaling/HDD-Hierarchical-Diffusion-Downscaling.
Chinese: 分层扩散降尺度(HDD)模型通过引入可扩展的分层采样过程,在保持竞争力的气候数据降尺度精度的同时显著降低计算负荷,为高分辨率气候预测提供了一种轻量级解决方案。
English: The Hierarchical Diffusion Downscaling (HDD) model introduces a computationally efficient hierarchical sampling process that achieves competitive accuracy in climate data downscaling while significantly reducing computational load, offering a lightweight solution for high-resolution climate projections.
Authors:Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin, Kristoffer Wickstrøm, Elisabeth Wetzer, Robert Jenssen, Maik Stille, Michael Kampffmeyer
Abstract:
Regular mammography screening is essential for early breast cancer detection. Deep learning-based risk prediction methods have sparked interest to adjust screening intervals for high-risk groups. While early methods focused only on current mammograms, recent approaches leverage the temporal aspect of screenings to track breast tissue changes over time, requiring spatial alignment across different time points. Two main strategies for this have emerged: explicit feature alignment through deformable registration and implicit learned alignment using techniques like transformers, with the former providing more control. However, the optimal approach for explicit alignment in mammography remains underexplored. In this study, we provide insights into where explicit alignment should occur (input space vs. representation space) and if alignment and risk prediction should be jointly optimized. We demonstrate that jointly learning explicit alignment in representation space while optimizing risk estimation performance, as done in the current state-of-the-art approach, results in a trade-off between alignment quality and predictive performance and show that image-level alignment is superior to representation-level alignment, leading to better deformation field quality and enhanced risk prediction accuracy. The code is available at https://github.com/sot176/Longitudinal_Mammogram_Alignment.git.
中文: 本研究探索纵向乳腺摄影的显式对齐策略,发现图像级对齐优于表征级对齐,既能提升形变场质量又能增强风险预测精度,避免了联合优化带来的性能权衡。
English: This study investigates explicit alignment strategies in longitudinal mammogram analysis, finding that image-level alignment outperforms representation-level alignment by improving deformation field quality and risk prediction accuracy without the trade-offs of joint optimization.
Authors:Jisu Shin, Juhyun Oh, Eunsu Kim, Hoyun Song, Alice Oh
Abstract:
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
中文: 提出的原子级评估框架以更精细的粒度衡量大语言模型的人物忠实度,有效检测出现有方法忽略的细微不一致性,并揭示任务结构和角色期望如何影响模型的适应性。
English: The proposed atomic-level evaluation framework measures persona fidelity in LLMs with finer granularity, effectively detecting subtle inconsistencies overlooked by existing methods and revealing how task structure and persona desirability impact model adaptability.
Authors:Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu
Abstract:
Masked Graph Auto-Encoder, a powerful graph self-supervised training paradigm, has recently shown superior performance in graph representation learning. Existing works typically rely on node contextual information to recover the masked information. However, they fail to generalize well to heterophilic graphs where connected nodes may be not similar, because they focus only on capturing the neighborhood information and ignoring the discrepancy information between different nodes, resulting in indistinguishable node representations. In this paper, to address this issue, we propose a Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE). It obtains more distinguishable node representations by reconstructing the discrepancy information of neighboring nodes during the masking process. We conduct extensive experiments on 17 widely-used benchmark datasets. The results show that our DGMAE can effectively preserve the discrepancies of nodes in low-dimensional space. Moreover, DGMAE significantly outperforms state-of-the-art graph self-supervised learning methods on three graph analytic including tasks node classification, node clustering, and graph classification, demonstrating its remarkable superiority. The code of DGMAE is available at https://github.com/zhengziyu77/DGMAE.
中文摘要:本文提出的差异感知图掩码自编码器(DGMAE)通过在掩码过程中重构节点差异信息,解决了现有方法在异配图上的局限性,在多项图分析任务中展现出卓越性能。
English Summary: The proposed Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE) addresses limitations in existing graph auto-encoders by reconstructing node discrepancy information during masking, achieving superior performance on heterophilic graphs across multiple graph analytic tasks.
Authors:Mingcheng Qu, Guang Yang, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan
Abstract:
Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv
中文: 本研究提出了一种多模态生存预测框架,通过超图学习有效整合病理切片与基因组数据,利用记忆机制解决模态不平衡和缺失问题,在五种TCGA数据集上验证了其优越性能。
English: This study introduces a multimodal survival prediction framework using hypergraph learning to effectively integrate pathology slides and genomic data while addressing modality imbalance and compensating for incomplete data through a memory mechanism, achieving superior performance over existing methods.
Authors:Yuang Yao, Ruiqi Wu, Yi Zhou, Tao Zhou
Abstract:
Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.
中文: RetCoP是首个眼底领域的持续视觉语言预训练框架,通过代表性图文对复现和离对角线信息蒸馏策略,逐步整合多模态图像与文本数据,在实现最佳泛化性能的同时显著降低了遗忘率。
English: RetCoP is a pioneering continual vision-language pre-training framework for fundus analysis that incrementally integrates multimodal image and text data while mitigating catastrophic forgetting through rehearsal strategies and off-diagonal distillation, achieving superior generalization with minimal forgetting.
Authors:Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu
Abstract:
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.
中文摘要:AirV2X-Perception提出了一种创新的无人机感知数据集,通过提供灵活的空中感知能力,解决了传统V2X系统在自动驾驶应用中的局限性。
English Summary: AirV2X-Perception introduces a novel drone-based dataset that overcomes the limitations of traditional V2X systems by providing flexible aerial perception capabilities for autonomous driving applications.
Authors:Rui Huang, Jincheng Zeng, Sen Gao, Yan Xing
Abstract:
Existing Mamba-based approaches in remote sensing change detection have enhanced scanning models, yet remain limited by their inability to capture long-range dependencies between image channels effectively, which restricts their feature representation capabilities. To address this limitation, we propose a 3D selective scan module (3D-SSM) that captures global information from both the spatial plane and channel perspectives, enabling a more comprehensive understanding of the data.Based on the 3D-SSM, we present two key components: a spatiotemporal interaction module (SIM) and a multi-branch feature extraction module (MBFEM). The SIM facilitates bi-temporal feature integration by enabling interactions between global and local features across images from different time points, thereby enhancing the detection of subtle changes. Meanwhile, the MBFEM combines features from the frequency domain, spatial domain, and 3D-SSM to provide a rich representation of contextual information within the image. Our proposed method demonstrates favourable performance compared to state-of-the-art change detection methods on five benchmark datasets through extensive experiments. Code is available at https://github.com/VerdantMist/3D-SSM
中文: 现有基于Mamba的遥感变化检测方法因无法有效捕捉图像通道间的长程依赖而受限,为此提出3D选择性扫描模块(3D-SSM),通过时空交互模块和多分支特征提取模块整合空间与通道的全局信息,在多个基准数据集上实现了优越的性能。
English: Current Mamba-based remote sensing change detection methods are constrained by inadequate long-range dependency capture across image channels, prompting the development of a 3D selective scan module (3D-SSM) that integrates spatial and channel perspectives to enhance feature representation through spatiotemporal interaction and multi-branch feature extraction, achieving superior performance on benchmark datasets.
Authors:Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim
Abstract:
Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT.
中文:MedErr-CT基准通过视觉问答评估医学多模态大模型在CT报告中识别和纠正错误的能力,揭示了不同错误类型间的显著性能差异,旨在提升临床应用的可靠性。
English: The MedErr-CT benchmark evaluates medical Multimodal Large Language Models' ability to identify and correct errors in CT reports through visual question answering, revealing significant performance variations across error types to enhance clinical reliability.
Authors:Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, Daphne Ippolito
Abstract:
Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation-costly steps that must be repeated for every architecture. In this work, we introduce Command-V, a backpropagation-free behavior transfer method that copies an existing residual activation adapter from a donor model and pastes its effect into a recipient model. Command-V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient's activation space. This process does not require access to the original training data and needs minimal compute. In three case studies-safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning--Command-V matches or exceeds the performance of direct finetuning while using orders of magnitude less compute. Our code and data are accessible at https://github.com/GithuBarry/Command-V/.
中文: Command-V是一种无需反向传播的高效方法,通过复制残差激活适配器并应用线性转换器在大型语言模型间迁移行为,以极低计算成本达到或超越微调效果。
English: Command-V is a computationally efficient, backpropagation-free method that transfers behaviors between large language models by copying residual activation adapters and applying linear converters, matching or surpassing fine-tuning performance with minimal resources.
Authors:Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha
Abstract:
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.
Chinese Summary: 本研究提出人类对齐忠实度(HAF)这一新标准,用于评估大语言模型毒性解释与人类推理的对齐程度,发现尽管模型能生成合理的基础解释,但在处理原因与毒性立场间的微妙关系时,其推理能力会出现崩溃。
English Summary: This research introduces Human-Aligned Faithfulness (HAF), a novel criterion to evaluate how well LLMs' toxicity explanations align with human reasoning, revealing that while models produce plausible basic explanations, their reasoning collapses when addressing nuanced relationships between reasons and toxicity stances.
Authors:Ilia Beletskii, Andrey Kuznetsov, Aibek Alanov
Abstract:
Recent advances in image editing with diffusion models have achieved impressive results, offering fine-grained control over the generation process. However, these methods are computationally intensive because of their iterative nature. While distilled diffusion models enable faster inference, their editing capabilities remain limited, primarily because of poor inversion quality. High-fidelity inversion and reconstruction are essential for precise image editing, as they preserve the structural and semantic integrity of the source image. In this work, we propose a novel framework that enhances image inversion using consistency models, enabling high-quality editing in just four steps. Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy and enables a controllable trade-off between editability and content preservation. We achieve state-of-the-art performance across various image editing tasks and datasets, demonstrating that our method matches or surpasses full-step diffusion models while being substantially more efficient. The code of our method is available on GitHub at https://github.com/ControlGenAI/Inverse-and-Edit.
中文摘要:本文提出了一种利用一致性模型增强图像反演的新框架,仅需四步即可实现高质量图像编辑,并在多种任务和数据集上达到最优性能。
English Summary: This paper introduces a novel framework that enhances image inversion using consistency models, enabling high-quality image editing in just four steps while achieving state-of-the-art performance across various tasks and datasets.
Authors:Georgii Bychkov, Khaled Abud, Egor Kovalev, Alexander Gushchin, Dmitriy Vatolin, Anastasia Antsiferova
Abstract:
Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI -- the first standard for end-to-end neural image compression (NIC) methods -- the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses' efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.
中文: NIC-RobustBench是首个开源框架,用于全面评估神经图像压缩方法的鲁棒性和对抗防御效率,通过整合多种编解码器和可扩展性,弥补了以往研究的不足。
English: NIC-RobustBench is the first open-source framework designed to comprehensively evaluate the robustness and adversarial defense efficiency of neural image compression methods, addressing limitations in prior research by incorporating a wide range of codecs and scalability.
Authors:Sahil Kale, Vijaykant Nadadur
Abstract:
When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models' perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.
中文摘要:人工智能将记忆误认为推理,导致其自我认知不可靠,在面对逻辑一致的任务变化时可行性评估出现超45%的不一致性,尤其在STEM领域最为显著。
English Summary: AI's confusion between memorization and genuine reasoning leads to unreliable self-assessment, with over 45% inconsistency in handling modified tasks, especially in STEM fields.
Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Abstract:
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
中文摘要:具备推理能力的大语言模型正在开创“智能深度研究”新范式,通过自主推理与迭代检索的深度融合,显著超越了传统搜索方法,有望成为未来信息获取的主导模式。
English Summary: Large Language Models with reasoning capabilities are pioneering Agentic Deep Research, a new paradigm that integrates autonomous reasoning and iterative retrieval to significantly outperform traditional search methods and redefine future information seeking.
Authors:Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
Abstract:
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
中文: Chain-of-Experts (CoE) 提出在层内实现专家顺序通信的新架构,通过动态路由和迭代处理增强模型表达能力,在数学推理任务上将验证损失从1.20降至1.12,相比传统混合专家模型内存使用减少17.6-42%。
English: Chain-of-Experts (CoE) introduces sequential expert communication within layers, enabling dynamic routing and iterative processing that enhances model capacity and reduces validation loss from 1.20 to 1.12 on math tasks while cutting memory usage by 17.6-42% compared to traditional MoE architectures.
Authors:Zhenke Liu, Jien Li, Ziqi Zhang
Abstract:
Extrachromosomal circular DNA (eccDNA) plays key regulatory roles and contributes to oncogene overexpression in cancer through high-copy amplification and long-range interactions. Despite advances in modeling, no pre-trained models currently support full-length circular eccDNA for downstream analysis. Existing genomic models are either limited to single-nucleotide resolution or hindered by the inefficiency of the quadratic attention mechanism. Here, we introduce eccDNAMamba, the first bidirectional state-space encoder tailored for circular DNA sequences. It combines forward and reverse passes for full-context representation learning with linear-time complexity, and preserves circular structure through a novel augmentation strategy. Tested on two real-world datasets, eccDNAMamba achieves strong classification performance and scales to sequences up to 200 Kbp, offering a robust and efficient framework for modeling circular genomes. Our codes are available at https://github.com/zzq1zh/GenAI-Lab.
中文摘要:本研究提出了eccDNAMamba,一种专为环形DNA序列设计的双向状态空间编码器,通过结合正反向传递和新型增强策略,在线性时间复杂度下实现了对全长环形DNA的高效建模,并在真实数据集上展现出优异的分类性能和可扩展性。
English Summary: The study introduces eccDNAMamba, a novel bidirectional state-space encoder designed for efficient full-length circular DNA modeling, achieving strong classification performance and scalability up to 200 Kbp with linear-time complexity.
Authors:Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
Abstract:
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs. The code is available at https://github.com/AoShuang92/SPLoRA.
中文摘要:SPLoRA是一种基于剪枝的新方法,通过E-DIEM度量选择性移除不安全的LoRA层,在提升微调后大语言模型安全性的同时保持性能优势,并降低推理开销。
English Summary: SPLoRA is a novel pruning-based method that enhances the safety of fine-tuned LLMs by selectively removing unsafe LoRA layers using the E-DIEM metric, achieving superior safety-utility balance with reduced inference overhead.
Authors:Lingyu Yang
Abstract:
Strategic randomization is a key principle in game theory, yet it remains underexplored in large language models (LLMs). Prior work often conflates the cognitive decision to randomize with the mechanical generation of randomness, leading to incomplete evaluations. To address this, we propose a novel zero-sum game inspired by the Tian Ji Horse Race, where the Nash equilibrium corresponds to a maximal entropy strategy. The game's complexity masks this property from untrained humans and underdeveloped LLMs. We evaluate five LLMs across prompt styles -- framed, neutral, and hinted -- using competitive multi-tournament gameplay with system-provided random choices, isolating the decision to randomize. Results show that weaker models remain deterministic regardless of prompts, while stronger models exhibit increased randomization under explicit hints. When facing weaker models, strong LLMs adopt deterministic strategies to exploit biases, but converge toward equilibrium play when facing peers. Through win/loss outcomes and Bayes factor analysis, we demonstrate meaningful variation in LLMs' strategic reasoning capabilities, highlighting opportunities for improvement in abstract reasoning and adaptive learning. We make our implementation publicly available at https://github.com/ocelopus/llm-when-to-throw-coin to ensure full reproducibility.
中文: 本研究基于田忌赛马设计了一个零和博弈来评估大语言模型的策略随机化能力,发现强模型能自适应逼近纳什均衡策略而弱模型保持确定性,结果揭示了抽象推理能力的差异。
English: This study introduces a zero-sum game based on the Tian Ji Horse Race to evaluate strategic randomization in LLMs, revealing that stronger models adaptively approach Nash equilibrium strategies while weaker ones remain deterministic, with findings highlighting gaps in abstract reasoning.
Authors:Di Zhang, Ligang Liu
Abstract:
We present an asymptotic analysis of shell lattice metamaterials based on Ciarlet's shell theory, introducing a new metric--asymptotic directional stiffness (ADS)--to quantify how the geometry of the middle surface governs the effective stiffness. We prove a convergence theorem that rigorously characterizes ADS and establishes its upper bound, along with necessary and sufficient condition for achieving it. As a key result, our theory provides the first rigorous explanation for the high bulk modulus observed in Triply Periodic Minimal Surfaces (TPMS)-based shell lattices. To optimize ADS on general periodic surfaces, we propose a triangular-mesh-based discretization and shape optimization framework. Numerical experiments validate the theoretical findings and demonstrate the effectiveness of the optimization under various design objectives. Our implementation is available at https://github.com/lavenklau/minisurf.
中文摘要:本研究引入渐近方向刚度(ADS)分析壳格点超材料,证明了严格解释TPMS结构高体积模量的收敛定理,并开发了经数值实验验证的形状优化框架。
English Summary: This study introduces asymptotic directional stiffness (ADS) to analyze shell lattice metamaterials, proving a convergence theorem that rigorously explains the high bulk modulus in TPMS-based structures and developing an optimization framework validated through numerical experiments.
Authors:Yang Liu, Chuanchen Luo, Zimo Tang, Yingyan Li, Yuran Yang, Yuanyong Ning, Lue Fan, Zhaoxiang Zhang, Junran Peng
Abstract:
Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.
中文摘要:TC-Light是一种新型生成式渲染器,通过两阶段优化全局光照与细粒度纹理,实现了物理可信的视频重渲染效果,具有卓越的时间一致性和低计算成本优势。
English Summary: TC-Light is a novel generative renderer that achieves physically plausible video re-rendering with superior temporal coherence and low computation cost through two-stage optimization of global illumination and fine-grained texture.
Authors:Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
Abstract:
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Project: https://github.com/Gen-Verse/ReasonFlux
中文: ReasonFlux-PRM是一种新型轨迹感知过程奖励模型,专门用于评估推理轨迹和响应,在多个基准测试中展现出优于传统模型的数据选择能力和性能提升。
English: ReasonFlux-PRM is a novel trajectory-aware process reward model designed to evaluate intermediate reasoning steps and responses, demonstrating superior data selection and performance gains in fine-tuning, reinforcement learning, and test-time scaling across multiple benchmarks.
Authors:Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao
Abstract:
Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately.
中文:LINO UniPS方法通过光寄存器令牌和交错注意力块实现光照与法向信息的解耦,结合小波双分支架构和法向梯度感知损失保留高频细节,在PS-Verse数据集上采用课程训练后,在多个基准测试中取得了最优性能。
English: The proposed LINO UniPS method introduces Light Register Tokens with interleaved attention to decouple illumination from surface normals, alongside a wavelet-based architecture and gradient loss to preserve geometric details, achieving state-of-the-art performance on benchmarks through curriculum training on the new PS-Verse dataset.
Authors:Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao
Abstract:
Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key components: (i) Light Register Tokens with light alignment supervision to aggregate point, direction, and environment lights; (ii) Interleaved Attention Block featuring global cross-image attention that takes all lighting conditions together so the encoder can factor out lighting while retaining normal-related evidence. Second, high-frequency geometric details are easily lost. We address this with (i) a Wavelet-based Dual-branch Architecture and (ii) a Normal-gradient Perception Loss. These techniques yield a unified feature space in which lighting is explicitly represented by register tokens, while normal details are preserved via wavelet branch. We further introduce PS-Verse, a large-scale synthetic dataset graded by geometric complexity and lighting diversity, and adopt curriculum training from simple to complex scenes. Extensive experiments show new state-of-the-art results on public benchmarks (e.g., DiLiGenT, Luces), stronger generalization to real materials, and improved efficiency; ablations confirm that Light Register Tokens + Interleaved Attention Block drive better feature decoupling, while Wavelet-based Dual-branch Architecture + Normal-gradient Perception Loss recover finer details.
中文:LINO UniPS方法通过光寄存器令牌和交错注意力块实现光照与法向信息的解耦,结合小波双分支架构和法向梯度感知损失保留高频细节,在PS-Verse数据集上采用课程训练后,在多个基准测试中取得了最优性能。
English: The proposed LINO UniPS method introduces Light Register Tokens with interleaved attention to decouple illumination from surface normals, alongside a wavelet-based architecture and gradient loss to preserve geometric details, achieving state-of-the-art performance on benchmarks through curriculum training on the new PS-Verse dataset.
Authors:Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
Abstract:
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.
中文: 本文提出交换向量量化(CommVQ)方法,通过加法量化和与旋转位置编码兼容的码本压缩键值缓存,在保持精度的同时将GPU内存使用降低高达87.5%,使LLaMA-3.1 8B模型能在单张RTX 4090显卡上处理128K长文本。
English: This paper introduces Commutative Vector Quantization (CommVQ), a method that reduces GPU memory usage for long-context LLM inference by compressing the key-value cache with additive quantization and a RoPE-commutative codebook, achieving up to 87.5% size reduction with minimal accuracy loss.
Authors:Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Abstract:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
中文: OmniGen2 是一种开源生成模型,通过双解码路径和反射机制统一处理文本到图像、图像编辑及上下文生成任务,在保持文本生成能力的同时实现了领先性能。
English: OmniGen2 is an open-source generative model that unifies text-to-image, image editing, and in-context generation tasks through dual decoding pathways and a reflection mechanism, achieving state-of-the-art performance while preserving text generation capabilities.
Authors:Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Abstract:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
中文: OmniGen2 是一种开源生成模型,通过双解码路径和反射机制统一处理文本到图像、图像编辑及上下文生成任务,在保持文本生成能力的同时实现了领先性能。
English: OmniGen2 is an open-source generative model that unifies text-to-image, image editing, and in-context generation tasks through dual decoding pathways and a reflection mechanism, achieving state-of-the-art performance while preserving text generation capabilities.
Authors:Olivier Gamache, Jean-Michel Fortin, MatÄj Boxan, François Pomerleau, Philippe Giguère
Abstract:
Standard datasets often present limitations, particularly due to the fixed nature of input data sensors, which makes it difficult to compare methods that actively adjust sensor parameters to suit environmental conditions. This is the case with Automatic-Exposure (AE) methods, which rely on environmental factors to influence the image acquisition process. As a result, AE methods have traditionally been benchmarked in an online manner, rendering experiments non-reproducible. Building on our prior work, we propose a methodology that utilizes an emulator capable of generating images at any exposure time. This approach leverages BorealHDR, a unique multi-exposure stereo dataset, along with its new extension, in which data was acquired along a repeated trajectory at different times of the day to assess the impact of changing illumination. In total, BorealHDR covers 13.4 km over 59 trajectories in challenging lighting conditions. The dataset also includes lidar-inertial-odometry-based maps with pose estimation for each image frame, as well as Global Navigation Satellite System (GNSS) data for comparison. We demonstrate that by using images acquired at various exposure times, we can emulate realistic images with a Root-Mean-Square Error (RMSE) below 1.78% compared to ground truth images. Using this offline approach, we benchmarked eight AE methods, concluding that the classical AE method remains the field's best performer. To further support reproducibility, we provide in-depth details on the development of our backpack acquisition platform, including hardware, electrical components, and performance specifications. Additionally, we share valuable lessons learned from deploying the backpack over more than 25 km across various environments. Our code and dataset are available online at this link: https://github.com/norlab-ulaval/TFR24 BorealHDR
中文: 标准数据集难以比较自适应传感器方法,因此本研究利用BorealHDR数据集开发模拟器进行离线基准测试,发现传统自动曝光方法最优,并共享采集平台细节以支持重现性研究。
English: Standard datasets hinder reproducible comparisons of adaptive sensor methods like Automatic Exposure, so this study introduces an emulator using the BorealHDR dataset to benchmark AE methods offline, finding classical AE performs best and sharing detailed platform insights for reproducibility.
Authors:Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
Abstract:
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65\% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
中文:提出的ConciseHint框架通过在推理生成过程中注入自适应提示,有效解决大型推理模型冗长问题,能在保持性能的同时生成简洁输出,并可无缝兼容现有方法。
English: The proposed ConciseHint framework addresses the verbosity of large reasoning models by injecting adaptive hints during reasoning generation, effectively producing concise outputs while maintaining performance and compatibility with existing methods.
Authors:Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
Abstract:
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, a critical issue is their tendency to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting learnable hints (manually designed or learned on concise data) during the generation of the reasoning. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning while maintaining the performance well. Moreover, we show that ConciseHint is flexible and can be seamlessly integrated with existing methods to further push the upper bound of the efficiency.
中文:提出的ConciseHint框架通过在推理生成过程中注入自适应提示,有效解决大型推理模型冗长问题,能在保持性能的同时生成简洁输出,并可无缝兼容现有方法。
English: The proposed ConciseHint framework addresses the verbosity of large reasoning models by injecting adaptive hints during reasoning generation, effectively producing concise outputs while maintaining performance and compatibility with existing methods.
Authors:Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary
Abstract:
The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).
中文摘要:本文提出基于超像素的补丁池化技术和轻量潜在注意力模块,有效降低视觉Transformer的复杂度,在保持与先进方法相当性能的同时显著提升计算效率。
English Summary: This paper introduces a Super-Pixel Based Patch Pooling technique and Light Latent Attention module to reduce Vision Transformer complexity, improving computational efficiency while maintaining competitive performance with state-of-the-art methods.
Authors:Yitong Zhu, Guanxuan Jiang, Zhuowen Liang, Yuyang Wang
Abstract:
Cybersickness remains a critical barrier to the widespread adoption of Virtual Reality (VR), particularly in scenarios involving intense or artificial motion cues. Among the key contributors is excessive optical flow-perceived visual motion that, when unmatched by vestibular input, leads to sensory conflict and discomfort. While previous efforts have explored geometric or hardware based mitigation strategies, such methods often rely on predefined scene structures, manual tuning, or intrusive equipment. In this work, we propose U-MAD, a lightweight, real-time, AI-based solution that suppresses perceptually disruptive optical flow directly at the image level. Unlike prior handcrafted approaches, this method learns to attenuate high-intensity motion patterns from rendered frames without requiring mesh-level editing or scene specific adaptation. Designed as a plug and play module, U-MAD integrates seamlessly into existing VR pipelines and generalizes well to procedurally generated environments. The experiments show that U-MAD consistently reduces average optical flow and enhances temporal stability across diverse scenes. A user study further confirms that reducing visual motion leads to improved perceptual comfort and alleviated cybersickness symptoms. These findings demonstrate that perceptually guided modulation of optical flow provides an effective and scalable approach to creating more user-friendly immersive experiences. The code will be released at https://github.com/XXXXX (upon publication).
中文: U-MAD是一种轻量级AI解决方案,通过实时抑制图像中的干扰性光流来减轻VR晕动症,无需场景特定调整即可提升用户舒适度。
English: U-MAD is a lightweight AI solution that reduces cybersickness in VR by suppressing disruptive optical flow in real-time, improving user comfort without requiring scene-specific adjustments.
Authors:Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao
Abstract:
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
中文摘要:大型语言模型即使在简单任务中也存在显著的自洽性问题,本研究提出的度量方法和缓解方案虽取得部分改进,但凸显了实现可靠人工智能推理的复杂性。
English Summary: Large Language Models exhibit significant self-consistency issues even in simple tasks, prompting the development of metrics and mitigation methods that partially address but underscore the complexity of achieving reliable AI reasoning.
Authors:Chong Zhang, Xiang Li, Jia Wang, Shan Liang, Haochen Xue, Xiaobo Jin
Abstract:
Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy. However, the diversity of user requirements often leads to unintended misinterpretations, where automated optimizations distort original intentions and produce erroneous outputs. To address this challenge, we propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability. Our approach dynamically evaluates the impact of such strategies on LLM performance, enabling robust adversarial sample generation. Through extensive experiments on open and closed-source LLMs, we demonstrate AGBS's effectiveness in balancing semantic consistency and attack efficacy. Our findings offer actionable insights for designing more reliable prompt optimization systems. Code is available at: https://github.com/franz-chang/DOBS
中文摘要:自适应贪婪二分搜索(AGBS)方法通过保持语义稳定性并评估优化策略对大型语言模型性能的影响,有效解决了自动提示工程中的误解问题,实验证明其在开源和闭源模型上均能平衡语义一致性与攻击效果。
English Summary: The Adaptive Greedy Binary Search (AGBS) method is introduced to mitigate misinterpretations in automatic prompt engineering by preserving semantic stability while evaluating optimization impacts on LLM performance, demonstrating effectiveness through experiments on various models.
Authors:Jie Li, Shifei Ding, Lili Guo, Xuan Li
Abstract:
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
Chinese: 提出的MAGTKD模型通过提示学习和知识蒸馏增强模态表示,并利用多模态锚点门控变换器有效整合各模态信息,在基准数据集上实现了对话中情感识别的最新性能。
English: The proposed MAGTKD model enhances emotion recognition in conversations by using prompt learning and knowledge distillation to improve modality representations and integrates them effectively with a multi-modal anchor gated transformer, achieving state-of-the-art results on benchmark datasets.
Authors:Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou
Abstract:
We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.
中文:Matrix-Game 是一种可控游戏世界生成模型,采用两阶段训练流程和超过170亿参数,能生成高质量、交互式视频并实现精确动作控制,在所有评估指标上均优于先前模型。
English: Matrix-Game is a controllable game world generation model that uses a two-stage training pipeline and over 17 billion parameters to produce high-quality, interactive videos with precise action control, outperforming previous models across all evaluation metrics.
Authors:Yuchang Zhu, Jintang Li, Huizhe Zhang, Liang Chen, Zibin Zheng
Abstract:
Individual fairness (IF) in graph neural networks (GNNs), which emphasizes the need for similar individuals should receive similar outcomes from GNNs, has been a critical issue. Despite its importance, research in this area has been largely unexplored in terms of (1) a clear understanding of what induces individual unfairness in GNNs and (2) a comprehensive consideration of identifying similar individuals. To bridge these gaps, we conduct a preliminary analysis to explore the underlying reason for individual unfairness and observe correlations between IF and similarity consistency, a concept introduced to evaluate the discrepancy in identifying similar individuals based on graph structure versus node features. Inspired by our observations, we introduce two metrics to assess individual similarity from two distinct perspectives: topology fusion and feature fusion. Building upon these metrics, we propose Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight behind SaGIF is the integration of individual similarities by independently learning similarity representations, leading to an improvement of IF in GNNs. Our experiments on several real-world datasets validate the effectiveness of our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms state-of-the-art IF methods while maintaining utility performance. Code is available at: https://github.com/ZzoomD/SaGIF.
中文: 本研究针对图神经网络中的个体公平性问题,提出了两种相似性度量指标和SaGIF方法,通过独立学习相似性表示来提升公平性,同时保持模型性能。
English: This study addresses individual fairness in graph neural networks by introducing two similarity metrics and proposing SaGIF, a method that improves fairness through independent similarity representation learning while maintaining performance.
Authors:Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen
Abstract:
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.
中文摘要:本文提出了首个采用混合场景表示的分布式多智能体协同神经SLAM框架,并创建了新型真实世界数据集,在建模、跟踪和通信效率方面均展现出优越性能。
English Summary: This paper introduces the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation and a novel real-world dataset, demonstrating superior performance in mapping, tracking, and communication efficiency.
Authors:Jingming Liu, Yumeng Li, Wei Shi, Yao-Xiang Ding, Hui Su, Kun Zhou
Abstract:
Recent studies have proposed leveraging Large Language Models (LLMs) as information retrievers through query rewriting. However, for challenging corpora, we argue that enhancing queries alone is insufficient for robust semantic matching; the LLM should also have sufficient understanding of the corpus by directly handling and augmenting the documents themselves. To this end, we present an LLM-based retriever empowered to augment both user queries and corpus documents, with its policy fully explored via reinforcement learning (RL) and minimal human inductive bias. Notably, we find that simply allowing the LLM to modify documents yields little benefit unless paired with our carefully designed bidirectional RL framework, which enables the LLM to simultaneously learn and collaborate on both query and document augmentation policies. A key technical challenge in realizing such a framework lies in jointly updating both policies during training, where the rewards for the two directions depend on each other, making their entangled reward intractable. Our approach addresses this by introducing a reward sampling strategy and a specifically designed RL algorithm that enables effective training with these sampled rewards. Experimental results demonstrate that our approach significantly enhances LLM-based retrieval performance in both sparse and dense settings, particularly in difficult retrieval domains, and achieves strong cross-benchmark generalization. Our code is released at https://github.com/liujm2001/CoAugRetriever.
中文摘要:本研究提出了一种双向强化学习框架,使大语言模型能够同时增强查询和文档以提升检索性能,并通过创新的奖励采样策略和算法解决了双向奖励相互依赖的难题。
English summary: This study introduces a bidirectional reinforcement learning framework that enables large language models to enhance both queries and documents for improved retrieval performance, addressing the challenge of entangled rewards through a novel sampling strategy and algorithm.
Authors:Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang
Abstract:
Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.
Chinese: BiGen框架通过知识检索机制和双模态并行学习,解决了全切片图像自动生成病理报告中语义内容不足和信息冗余的问题,在自然语言处理和分类指标上均实现了显著提升,达到最优性能。
English: The BiGen framework addresses semantic deficiency and information redundancy in automated pathology report generation by incorporating a knowledge retrieval mechanism and bi-modal concurrent learning, achieving state-of-the-art performance with significant improvements in NLP and classification metrics.
Authors:Kurt Butler, Guanchao Feng, Tong Chen, Petar Djuric
Abstract:
Probabilistic models are often used to make predictions in regions of the data space where no observations are available, but it is not always clear whether such predictions are well-informed by previously seen data. In this paper, we propose a knowledge score for predictions from Gaussian process regression (GPR) models that quantifies the extent to which observing data have reduced our uncertainty about a prediction. The knowledge score is interpretable and naturally bounded between 0 and 1. We demonstrate in several experiments that the knowledge score can anticipate when predictions from a GPR model are accurate, and that this anticipation improves performance in tasks such as anomaly detection, extrapolation, and missing data imputation. Source code for this project is available online at https://github.com/KurtButler/GP-knowledge.
Chinese: 本文为高斯过程回归模型提出了一种知识评分,用于量化预测中不确定性的减少程度,并证明该评分在异常检测和数据插补等任务中能有效提升性能。
English: This paper introduces a knowledge score for Gaussian process regression models that measures uncertainty reduction in predictions, demonstrating its effectiveness in improving tasks like anomaly detection and data imputation.
Authors:Mauricio Byrd Victorica, György Dán, Henrik Sandberg
Abstract:
State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.
中文: 提出的SpaNN检测器通过使用二值化特征图集合和聚类技术,有效识别对抗性补丁攻击,其计算效率与补丁数量无关,性能显著优于现有防御方法。
English: The proposed SpaNN detector effectively identifies adversarial patch attacks by using an ensemble of binarized feature maps and clustering, outperforming existing defenses with computational efficiency independent of patch count.
Authors:Nikhil Khedekar, Kostas Alexis
Abstract:
LiDAR-Inertial Odometry (LIO) is widely used for accurate state estimation and mapping which is an essential requirement for autonomous robots. Conventional LIO methods typically rely on formulating constraints from the geometric structure sampled by the LiDAR. Hence, in the lack of geometric structure, these tend to become ill-conditioned (degenerate) and fail. Robustness of LIO to such conditions is a necessity for its broader deployment. To address this, we propose PG-LIO, a real-time LIO method that fuses photometric and geometric information sampled by the LiDAR along with inertial constraints from an Inertial Measurement Unit (IMU). This multi-modal information is integrated into a factor graph optimized over a sliding window for real-time operation. We evaluate PG-LIO on multiple datasets that include both geometrically well-conditioned as well as self-similar scenarios. Our method achieves accuracy on par with state-of-the-art LIO in geometrically well-structured settings while significantly improving accuracy in degenerate cases including against methods that also fuse intensity. Notably, we demonstrate only 1 m drift over a 1 km manually piloted aerial trajectory through a geometrically self-similar tunnel at an average speed of 7.5m/s (max speed 10.8 m/s). For the benefit of the community, we shall also release our source code https://github.com/ntnu-arl/mimosa
中文: PG-LIO通过融合激光雷达的光度与几何信息及惯性测量单元数据,在结构化环境中保持高精度,并在几何特征缺失的退化场景中显著提升定位鲁棒性。
English: PG-LIO enhances LiDAR-Inertial Odometry by integrating photometric and geometric data with IMU constraints, achieving high accuracy in structured environments and significantly reducing drift in geometrically degenerate scenarios.
Authors:Haoyi Wu, Zhihao Teng, Kewei Tu
Abstract:
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.
Chinese: 并行连续思维链(PCCoT)通过雅可比迭代并行更新潜在思维标记,在保持相当或更优性能的同时,将训练和推理时间减少近50%,并提升了稳定性和鲁棒性。
English: Parallel Continuous Chain-of-Thought (PCCoT) enhances efficiency by using Jacobi iteration to update latent thought tokens in parallel, achieving comparable or better performance while cutting training and inference time by nearly 50% and improving stability.
Authors:Jan Michalczyk, Stephan Weiss, Jan Steinbrener
Abstract:
Using 3D point clouds in odometry estimation in robotics often requires finding a set of correspondences between points in subsequent scans. While there are established methods for point clouds of sufficient quality, state-of-the-art still struggles when this quality drops. Thus, this paper presents a novel learning-based framework for predicting robust point correspondences between pairs of noisy, sparse and unstructured 3D point clouds from a light-weight, low-power, inexpensive, consumer-grade System-on-Chip (SoC) Frequency Modulated Continuous Wave (FMCW) radar sensor. Our network is based on the transformer architecture which allows leveraging the attention mechanism to discover pairs of points in consecutive scans with the greatest mutual affinity. The proposed network is trained in a self-supervised way using set-based multi-label classification cross-entropy loss, where the ground-truth set of matches is found by solving the Linear Sum Assignment (LSA) optimization problem, which avoids tedious hand annotation of the training data. Additionally, posing the loss calculation as multi-label classification permits supervising on point correspondences directly instead of on odometry error, which is not feasible for sparse and noisy data from the SoC radar we use. We evaluate our method with an open-source state-of-the-art Radar-Inertial Odometry (RIO) framework in real-world Unmanned Aerial Vehicle (UAV) flights and with the widely used public Coloradar dataset. Evaluation shows that the proposed method improves the position estimation accuracy by over 14 % and 19 % on average, respectively. The open source code and datasets can be found here: https://github.com/aau-cns/radar_transformer.
Chinese: 本文提出了一种基于Transformer的学习框架,通过消费级雷达传感器生成的噪声稀疏3D点云预测稳健的点对应关系,在真实无人机测试中将定位精度平均提升了14%以上。
English: This paper introduces a learning-based transformer framework that predicts robust point correspondences for odometry estimation using noisy, sparse 3D point clouds from consumer-grade radar sensors, achieving over 14% accuracy improvement in real-world UAV tests.
Authors:Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
Abstract:
Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.
中文: 本研究提出了一种新颖的视听声源定位框架,利用多模态大语言模型生成区分发声物体与静默物体的详细上下文信息,并通过两种新型损失函数增强定位效果,在基准数据集上显著优于现有方法。
English: This study introduces a novel audio-visual sound source localization framework that utilizes Multimodal Large Language Models to generate detailed contextual information distinguishing sound-making objects from silent ones, enhanced by two new loss functions which significantly outperform existing methods on benchmark datasets.
Authors:Muhao Xu, Xueying Zhou, Xizhan Gao, Weiye Song, Guang Feng, Sijie Niu
Abstract:
Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at https://github.com/Xmh-L/NPGMF.
Chinese: 本文提出了一种新颖的无监督异常检测方法,通过将正常样本的多语义特征融入重建过程来解决逻辑异常检测难题,在MVTec LOCO AD数据集上实现了最先进的性能表现。
English: This paper introduces a novel unsupervised anomaly detection method that addresses the challenge of logical anomalies by integrating multi-semantic features from normal samples into the reconstruction process, achieving state-of-the-art performance on the MVTec LOCO AD dataset.
Authors:JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao, Yanye Lu
Abstract:
Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (\textbf{MV-AR}) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the ``Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. The code and models are released at https://github.com/MILab-PKU/MVAR.
中文: 本文提出的多视图自回归(MV-AR)方法通过自回归模型、条件注入模块和数据增强技术,能够根据多种输入提示逐步生成一致的多视图图像,其性能与领先的基于扩散的模型相当。
English: This paper introduces the Multi-View Auto-Regressive (MV-AR) method, which progressively generates consistent multi-view images from various prompts using an auto-regressive model, condition injection modules, and data augmentation, achieving performance comparable to leading diffusion-based models.
Authors:Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
中文:MedTVT-R1是一种新型多模态大语言模型,通过整合临床数据和采用强化微调技术,实现了精准的多疾病诊断与可解释推理,在医疗应用中展现出卓越性能。
English: MedTVT-R1 is a novel multimodal large language model that integrates clinical data for accurate multi-disease diagnosis and interpretable reasoning, demonstrating superior performance through advanced modality fusion and reinforcement fine-tuning.
Authors:Aniss Bessalah, Hatem Mohamed Abdelmoumen, Karima Benatchba, Hadjer Benmeziane
Abstract:
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at https://github.com/IBM/analog-nas/tree/main/analognasbench.
模拟内存计算(AIMC)能高效加速深度神经网络,但现有网络和架构搜索基准未考虑其特殊非理想特性,因此推出AnalogNAS-Bench基准来发现稳健架构并揭示关键设计规律。
Analog In-memory Computing (AIMC) accelerates deep neural networks efficiently, but current neural networks and NAS benchmarks overlook AIMC-specific non-idealities, prompting the introduction of AnalogNAS-Bench to identify robust architectures and reveal key design insights.
Authors:Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
Abstract:
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
中文: 本研究通过使用自动语音识别转录歌曲并采用多种检测器,提出了一种强大的AI生成音乐检测方法,在多语言和多流派场景下表现优异,且在处理失真音频和不同音乐生成器时优于现有音频检测技术。
English: This study addresses the limitations of existing AI-generated music detection methods by proposing a robust approach that transcribes songs using automatic speech recognition and employs multiple detectors, demonstrating strong performance across languages and genres while outperforming audio-based methods when handling perturbed audio and diverse music generators.
Authors:Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome
Abstract:
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP
中文: DIP是一种无监督的后训练方法,通过结合预训练扩散模型生成伪任务来增强视觉编码器中的密集图像表示,在各种上下文场景理解任务中高效地实现了卓越性能。
English: DIP is an unsupervised post-training method that enhances dense image representations in vision encoders by using pseudo-tasks generated with a diffusion model, achieving superior performance in various in-context scene understanding tasks efficiently.
Authors:Yang Lyu, Zhenghao Zou, Yanfeng Li, Chunhui Zhao, Quan Pan
Abstract:
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
中文摘要:本文提出了一种无需IMU和特征关联的框架,通过结合事件相机与毫米波雷达,在高度动态场景中实现鲁棒的自身运动速度估计,实验验证表明该方法在挑战性环境中具有可靠性能。
English Summary: This paper introduces an IMU-free and feature-association-free framework that combines event cameras and millimeter wave radar to achieve robust ego-motion velocity estimation in highly dynamic scenarios, demonstrating reliable performance in challenging environments through experimental validation.
Authors:Yiyao Wang, Bo Pan, Ke Wang, Han Liu, Jinyuan Mao, Yuxin Liu, Minfeng Zhu, Xiuqi Huang, Weifeng Chen, Bo Zhang, Wei Chen
Abstract:
Direct volume rendering (DVR) is a fundamental technique for visualizing volumetric data, where transfer functions (TFs) play a crucial role in extracting meaningful structures. However, designing effective TFs remains unintuitive due to the semantic gap between user intent and TF parameter space. Although numerous TF optimization methods have been proposed to mitigate this issue, existing approaches still face two major challenges: the vast exploration space and limited generalizability. To address these issues, we propose IntuiTF, a novel framework that leverages Multimodal Large Language Models (MLLMs) to guide TF optimization in alignment with user intent. Specifically, our method consists of two key components: (1) an evolution-driven explorer for effective exploration of the TF space, and (2) an MLLM-guided human-aligned evaluator that provides generalizable visual feedback on rendering quality. The explorer and the evaluator together establish an efficient Trial-Insight-Replanning paradigm for TF space exploration. We further extend our framework with an interactive TF design system. We demonstrate the broad applicability of our framework through three case studies and validate the effectiveness of each component through extensive experiments. We strongly recommend readers check our cases, demo video, and source code at: https://github.com/wyysteelhead/IntuiTF
中文摘要:IntuiTF是一种创新框架,利用多模态大语言模型通过进化驱动探索器和人类对齐评估器来指导传递函数优化,建立了高效的“试验-洞察-重规划”范式,有效解决了直接体绘制中的探索空间广阔和泛化性有限两大挑战。
English Summary: IntuiTF is a novel framework that uses Multimodal Large Language Models to guide transfer function optimization through an evolution-driven explorer and human-aligned evaluator, addressing challenges in direct volume rendering by creating an efficient Trial-Insight-Replanning paradigm.
Authors:Haoneng Lin, Cheng Xu, Jing Qin
Abstract:
Modern Vision-Language Models (VLMs) exhibit unprecedented capabilities in cross-modal semantic understanding between visual and textual modalities. Given the intrinsic need for multi-modal integration in clinical applications, VLMs have emerged as a promising solution for a wide range of medical image analysis tasks. However, adapting general-purpose VLMs to medical domain poses numerous challenges, such as large domain gaps, complicated pathological variations, and diversity and uniqueness of different tasks. The central purpose of this review is to systematically summarize recent advances in adapting VLMs for medical image analysis, analyzing current challenges, and recommending promising yet urgent directions for further investigations. We begin by introducing core learning strategies for medical VLMs, including pretraining, fine-tuning, and prompt learning. We then categorize five major VLM adaptation strategies for medical image analysis. These strategies are further analyzed across eleven medical imaging tasks to illustrate their current practical implementations. Furthermore, we analyze key challenges that impede the effective adaptation of VLMs to clinical applications and discuss potential directions for future research. We also provide an open-access repository of related literature to facilitate further research, available at https://github.com/haonenglin/Awesome-VLM-for-MIA. It is anticipated that this article can help researchers who are interested in harnessing VLMs in medical image analysis tasks have a better understanding on their capabilities and limitations, as well as current technical barriers, to promote their innovative, robust, and safe application in clinical practice.
中文: 现代视觉语言模型在医学图像分析中潜力巨大,但面临领域差异和任务多样性等挑战,本综述旨在总结其适应策略、分析当前障碍并建议未来研究方向,以促进其在临床中的创新和安全应用。
English: Modern Vision-Language Models show great potential for medical image analysis but face challenges like domain gaps and task diversity, prompting this review to summarize adaptation strategies, analyze obstacles, and suggest future research directions to enhance their clinical application.
Authors:Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon
Abstract:
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
中文: 本文提出了一种基于强化学习的后训练框架,显著提升了多模态大语言模型生成个性化图像描述的能力,尤其在现有方法表现不佳的复杂多概念场景中表现突出。
English: This paper introduces a reinforcement learning-based post-training framework that significantly improves multi-modal large language models' ability to generate personalized and accurate image captions, particularly excelling in complex multi-concept scenarios where existing methods falter.
Authors:Tongshun Zhang, Pingping Liu, Mengen Cai, Zijian Zhang, Yubing Lu, Qiuzhan Zhou
Abstract:
Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency. Code is available at https://github.com/bywlzts/BSMamba.
中文: BSMamba提出了一种新颖的视觉Mamba架构,包含亮度Mamba和语义Mamba组件,通过基于亮度和语义相似性的令牌交互优先机制,在增强低光图像的同时保持语义一致性,性能优于现有方法。
English: BSMamba introduces a novel visual Mamba architecture with Brightness Mamba and Semantic Mamba components, which prioritize interactions between tokens based on brightness and semantic similarity to enhance low-light images while maintaining semantic consistency and outperforming existing methods.
Authors:Kawser Ahmed, Mir Shahriar Fardin, Md Arif Faysal Nayem, Fahim Hafiz, Swakkhar Shatabda
Abstract:
The increasing demand for underwater exploration and rescue operations enforces the development of advanced wireless or semi-wireless underwater vessels equipped with manipulator arms. This paper presents the implementation of a semi-wireless underwater vehicle, "TritonZ" equipped with a manipulator arm, tailored for effective underwater exploration and rescue operations. The vehicle's compact design enables deployment in different submarine surroundings, addressing the need for wireless systems capable of navigating challenging underwater terrains. The manipulator arm can interact with the environment, allowing the robot to perform sophisticated tasks during exploration and rescue missions in emergency situations. TritonZ is equipped with various sensors such as Pi-Camera, Humidity, and Temperature sensors to send real-time environmental data. Our underwater vehicle controlled using a customized remote controller can navigate efficiently in the water where Pi-Camera enables live streaming of the surroundings. Motion control and video capture are performed simultaneously using this camera. The manipulator arm is designed to perform various tasks, similar to grasping, manipulating, and collecting underwater objects. Experimental results shows the efficacy of the proposed remotely operated vehicle in performing a variety of underwater exploration and rescue tasks. Additionally, the results show that TritonZ can maintain an average of 13.5cm/s with a minimal delay of 2-3 seconds. Furthermore, the vehicle can sustain waves underwater by maintaining its position as well as average velocity. The full project details and source code can be accessed at this link: https://github.com/kawser-ahmed-byte/TritonZ
Chinese: 本文介绍了半无线水下航行器"TritonZ",它配备机械臂和实时环境数据传感器,能够在复杂水下环境中高效执行勘探与救援任务。
English: This paper introduces "TritonZ," a semi-wireless underwater vehicle equipped with a manipulator arm and sensors for real-time data transmission, designed to perform exploration and rescue tasks efficiently in challenging underwater environments.
Authors:Saad Wazir, Daeyoung Kim
Abstract:
Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76% on MoNuSeg, 3.12% on DSB, 2.87% on Electron Microscopy, and 4.03% on TNBC datasets compared to existing SOTA methods. Code: https://github.com/saadwazir/MCADS-Decoder
中文: 我们提出的架构采用新型解码器,能有效捕捉多尺度上下文信息并整合编码器特征,从而在四个数据集上显著提升了医学图像分割的准确性,超越了现有最优方法。
English: Our proposed architecture with a novel decoder effectively captures multi-scale contextual information and integrates encoder features to enhance medical image segmentation, outperforming state-of-the-art methods across four datasets.
Authors:Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
Abstract:
We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at https://github.com/netease-youdao/Confucius3-Math.
Chinese: Confucius3-Math是一个开源140亿参数模型,可在消费级GPU上高效运行,在数学推理任务中达到顶尖水平,并针对中国K-12数学教育通过强化学习技术创新实现了低成本高性能。
English: Confucius3-Math is an open-source 14B-parameter model that efficiently runs on consumer GPUs and achieves state-of-the-art performance in mathematical reasoning, specifically tailored for Chinese K-12 education with innovations in reinforcement learning training.
Authors:Zhifeng Deng, P. -A. Absil, Kyle A. Gallivan, Wen Huang
Abstract:
The matrix exponential restricted to skew-symmetric matrices has numerous applications, notably in view of its interpretation as the Lie group exponential and Riemannian exponential for the special orthogonal group. We characterize the invertibility of the derivative of the skew-restricted exponential, thereby providing a simple expression of the tangent conjugate locus of the orthogonal group. In view of the skew restriction, this characterization differs from the classic result on the invertibility of the derivative of the exponential of real matrices. Based on this characterization, for every skew-symmetric matrix $A$ outside the (zero-measure) tangent conjugate locus, we explicitly construct the domain and image of a smooth inverse -- which we term \emph{nearby logarithm} -- of the skew-restricted exponential around $A$. This nearby logarithm reduces to the classic principal logarithm of special orthogonal matrices when $A=\mathbf{0}$. The symbolic formulae for the differentiation and its inverse are derived and implemented efficiently. The extensive numerical experiments show that the proposed formulae are up to $3.9$-times and $3.6$-times faster than the current state-of-the-art robust formulae for the differentiation and its inversion, respectively.
中文: 本研究刻画了斜对称矩阵指数导数可逆性的特征,提出了一种计算高效的“邻近对数”,在微分和逆运算速度上均优于现有方法。
English: This study characterizes the invertibility of the derivative of the skew-symmetric matrix exponential, introducing a computationally efficient "nearby logarithm" that outperforms existing methods in speed for differentiation and inversion.
Authors:Muhammad Usama, Hee-Deok Jang, Soham Shanbhag, Yoo-Chang Sung, Seung-Jun Bae, Dong Eui Chang
Abstract:
This paper addresses the dual challenge of improving anomaly detection and signal integrity in high-speed dynamic random access memory signals. To achieve this, we propose a joint training framework that integrates an autoencoder with a classifier to learn more distinctive latent representations by focusing on valid data features. Our approach is evaluated across three anomaly detection algorithms and consistently outperforms two baseline methods. Detailed ablation studies further support these findings. Furthermore, we introduce a signal integrity enhancement algorithm that improves signal integrity by an average of 11.3%. The source code and data used in this study are available at https://github.com/Usama1002/learning-latent-representations.
本文提出了一种结合自动编码器和分类器的联合训练框架,用于提升高速动态随机存取存储器信号的异常检测与信号完整性,其性能优于基线方法并实现了11.3%的平均信号改善。
This paper introduces a joint training framework combining an autoencoder and classifier to enhance anomaly detection and signal integrity in high-speed DRAM signals, demonstrating superior performance over baselines and achieving an 11.3% average signal improvement.
Authors:Min Yin, Haoyu Liu, Boyi Lian, Chunlei Chai
Abstract:
This study introduces Co-Persona, a methodological framework bridging large-scale social media analysis with authentic user understanding through systematic integration of Large Language Models and expert validation. Through a case study of B.Co, a Chinese manufacturer, we investigated Co-Persona application in bedside lamp development. Our methodology analyzed over 38 million posts from Xiao Hongshu, employing multi-stage data processing combining advanced NLP with expert validation. Analysis revealed five user personas derived from bedtime behaviors: Health Aficionados, Night Owls, Interior Decorators, Child-care Workers, and Workaholics-each showing unique pre-sleep activities and product preferences. Findings demonstrate Co-Persona enhances manufacturers' ability to process large datasets while maintaining user understanding. The methodology provides structured approaches for targeted marketing and product strategies. Research contributes to theoretical understanding of data-driven persona development and practical applications in consumer-driven innovation. Code and data available at https://github.com/INFPa/LLMwithPersona.
中文: 本研究提出Co-Persona框架,通过整合大规模社交媒体分析与大语言模型及专家验证,识别出五类床头灯用户画像,为企业提供了数据驱动的产品开发和精准营销策略。
English: This study presents Co-Persona, a framework combining large-scale social media analysis with LLMs and expert validation to identify user personas for targeted product development, as demonstrated through a bedside lamp case study revealing five distinct user types.
Authors:Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
Chinese: RLPR是一种无需验证器的强化学习框架,它利用大语言模型自身对参考答案的标记概率作为奖励信号,有效提升了在通用领域和数学领域的推理能力,并超越了现有方法的性能。
English: RLPR is a verifier-free reinforcement learning framework that uses an LLM's own token probability scores for reference answers as the reward signal, effectively enhancing reasoning capabilities across both general and mathematical domains while outperforming existing methods.
Authors:Chao Li, Jiawei Fan, Anbang Yao
Abstract:
In this paper, we present Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78X to 3.31X on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available at https://github.com/deep-optimization/Morse.
中文: Morse是一种无损加速扩散模型的双采样框架,通过跳跃采样与自适应残差反馈相结合,在保持图像质量的同时显著提升了生成效率。
English: Morse is a dual-sampling framework that accelerates diffusion models losslessly by combining fast jump sampling with adaptive residual feedback, achieving significant speedups while maintaining image quality.
Authors:Ankit Sanjyal
Abstract:
Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. In this paper, we present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, our experiments reveal that all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9 to 13.0 compared to the baseline's 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases. We analyze potential causes including feature-task mismatch, overfitting to limited data, and integration challenges. Our findings challenge common assumptions in the field and suggest that simpler architectures focusing on geometric consistency may be more effective for few-shot scenarios.
中文: DINO增强的NeRF模型在少样本三维重建中意外表现不如基线NeRF,表明预训练视觉特征可能带来有害偏差而非提升效果。
English: DINO-enhanced NeRF models surprisingly underperform baseline NeRF in few-shot 3D reconstruction, suggesting pre-trained vision features may introduce harmful biases rather than improvements.
Authors:Youjie Zhou, Guofeng Mei, Yiming Wang, Yi Wan, Fabio Poiesi
Abstract:
Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.
中文: FMF-SLAM是一种高效的多模态融合SLAM方法,利用快速傅里叶变换和新型注意力机制,在噪声、变化光照和黑暗条件下提升性能,实现了最先进的结果并具备实时可行性。
English: FMF-SLAM is an efficient multimodal fusion SLAM method that uses fast Fourier transform and novel attention mechanisms to enhance performance in noisy, varying lighting, and dark conditions, achieving state-of-the-art results with real-time feasibility.
Authors:Zih-Hao Huang, You-Teng Lin, Hung-Hsuan Chen
Abstract:
This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: https://github.com/ianzih/Decoupled-Supervised-Learning-for-Information-Regularization/.
中文: 本文提出的DeInfoReg方法通过将长梯度流分解为短片段解决梯度消失问题,并利用多GPU流水线并行策略显著提升训练效率和模型性能。
English: This paper presents DeInfoReg, a method that breaks long gradient flows into shorter segments to combat vanishing gradients and leverages pipeline parallelism across GPUs for enhanced training efficiency and performance.
Authors:Donghyun Lee, Yuhang Li, Ruokai Yin, Shiting Xiao, Priyadarshini Panda
Abstract:
State Space Models (SSMs) have emerged as powerful alternatives to attention-based Transformers, with Mamba demonstrating impressive efficiency and scalability. As these models grow increasingly larger, the need for Parameter-Efficient Fine-Tuning (PEFT) methods becomes critical to adapt pre-trained Mamba to downstream tasks without prohibitive computational costs. However, previous approaches simply apply traditional Transformer-tailored PEFT methods without addressing the unique temporal processing dynamics of SSMs. To address this limitation, we propose Memba, a membrane-driven PEFT approach specifically designed for Mamba. Memba introduces Leaky Integrate Membrane (LIM) neurons as bio-inspired gating mechanisms that naturally accumulate membrane potentials over time, enhancing selective information retention. By strategically combining LIM neurons with Low-Rank Adaptations (LoRA) and cross-layer membrane transfer, our approach significantly improves Mamba's temporal modeling capabilities. Extensive experiments across language and vision tasks demonstrate that Memba achieves substantial improvements over existing PEFT methods. The code is available at https://github.com/Intelligent-Computing-Lab-Yale/Memba.
中文: 状态空间模型如Mamba是Transformer的高效替代方案,而提出的Memba方法通过引入具有膜电位的仿生门控机制,显著提升了参数高效微调中的时序建模能力。
English: State Space Models like Mamba offer efficient alternatives to Transformers, and the proposed Memba method introduces bio-inspired gating mechanisms with membrane potentials to enhance temporal modeling through Parameter-Efficient Fine-Tuning.
Authors:Quan Zhou, Gan Luo, Qiang Hu, Qingyong Zhang, Jinhua Zhang, Yinjiao Tian, Qiang Li, Zhiwei Wang
Abstract:
Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at https://github.com/Huster-Hq/DADA.
Chinese: 本文提出了一种对抗性扩散框架,通过合成高价值假阳性样本以增强息肉检测能力,该方法专注于多样背景模式并干扰检测器决策,显著超越了现有技术的检测性能。
English: This paper introduces an adversarial diffusion framework that synthesizes high-value false positives to enhance polyp detection by focusing on diverse background patterns and confusing detectors, significantly improving detection performance over existing methods.
Authors:Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, Zhongdongming Dai
Abstract:
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.
Chinese Summary: Chengyu-Bench是一个全面评估语言模型对中文成语理解能力的基准测试,包含三项任务,结果表明模型虽能准确判断成语情感倾向,但在语境运用和语义理解方面仍有明显不足。
English Summary: Chengyu-Bench is a comprehensive benchmark designed to evaluate language models' understanding of Chinese idioms through three tasks, revealing that while models excel at identifying sentiment, they struggle with contextual usage and meaning comprehension.
Authors:Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang
Abstract:
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.
中文摘要:本文提出包含InspireScore评估系统和InspireDebate辩论框架的双组件方案,通过建立多维度评估架构和分阶段优化方法,有效解决了现有大语言模型辩论系统在客观评估和结构化优化方面的不足,实验表明其与专家判断相关性提升44%,性能较基线模型提高57%。
English Summary: This paper introduces a dual-component framework, InspireScore and InspireDebate, to address limitations in current LLM-based debating systems by implementing multi-dimensional evaluation metrics and phased optimization techniques, achieving significantly higher correlation with expert judgments and performance improvements over baseline models.
Authors:Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang
Abstract:
The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: {https://github.com/ai-agents-2030/awesome-deep-research-agent}.
中文摘要:本文深入分析了深度研究智能体的核心技术架构,这种自主AI系统通过动态推理和多步骤规划执行复杂研究任务,同时评估了现有基准测试的局限性并指明了未来研究方向。
English Summary: This paper analyzes the core technologies and architectures of Deep Research agents, autonomous AI systems that perform complex research tasks through dynamic reasoning and multi-step planning, while also evaluating current benchmarks and outlining future research directions.
Authors:Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu
Abstract:
Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.
中文:RoboTwin 2.0通过自动化任务合成和领域随机化提出了可扩展的双臂操作数据生成框架,在仿真到实物的迁移和政策鲁棒性方面实现了显著提升。
English: RoboTwin 2.0 introduces a scalable framework for generating diverse bimanual manipulation data through automated task synthesis and domain randomization, achieving significant improvements in sim-to-real transfer and policy robustness.
Authors:Wenzhuo Liu, Yicheng Qiao, Zhen Wang, Qiannan Guo, Zilong Chen, Meihua Zhou, Xinran Li, Letian Wang, Zhiwei Li, Huaping Liu, Wenshuo Wang
Abstract:
Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.
Chinese: 本文提出TEM^3-Learning多模态多任务学习框架,通过高效时空特征提取和自适应模态融合,在四个辅助驾驶任务中实现最优性能,同时保持高速度和低计算成本。
English: This paper introduces TEM^3-Learning, a multimodal multi-task framework that efficiently integrates temporal-spatial feature extraction and adaptive modality fusion to achieve state-of-the-art performance in four assistive driving tasks while maintaining high speed and low computational cost.
Authors:Jisheng Dang, Huilin Song, Junbin Xiao, Bimei Wang, Han Peng, Haoxuan Li, Xun Yang, Meng Wang, Tat-Seng Chua
Abstract:
Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.
中文摘要:MUPA框架通过多路径推理和反思代理协同工作,在保持答案准确性的同时显著提升了视频问答的视觉依据可靠性,以更少参数实现卓越性能。
English Summary: The proposed MUPA framework enhances Grounded VideoQA by integrating multi-path reasoning and reflection agents, achieving superior grounding fidelity and answer accuracy with fewer parameters.
Authors:Hangzhou He, Jiachen Tang, Lei Zhu, Kaiwen Li, Yanye Lu
Abstract:
Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: https://github.com/riverback/TF-TTI-XMed.
中文: 提出的免训练混淆概念识别策略通过屏蔽混淆概念和增强判别性概念,仅使用少量带图像级标签的新数据,即可提升概念瓶颈模型在医学图像分类中的跨域性能。
English: The proposed training-free confusion concept identification strategy enhances out-of-domain performance for Concept Bottleneck Models in medical image classification by masking confounding concepts and amplifying discriminative ones, using minimal new data with only image-level labels.
Authors:Xiangfei Qiu, Zhe Li, Wanghui Qiu, Shiyan Hu, Lekui Zhou, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Aoying Zhou, Zhenli Sheng, Jilin Hu, Christian S. Jensen, Bin Yang
Abstract:
Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at https://github.com/decisionintelligence/TAB.
中文: 该摘要介绍了TAB这一新型时间序列异常检测基准,它整合了多样化数据集、多种检测方法类型和统一评估流程,旨在解决现有评估缺陷并提供公平的性能比较。
English: The abstract introduces TAB, a new benchmark for time series anomaly detection that includes diverse datasets, multiple method types, and a unified evaluation pipeline to address current evaluation deficiencies and provide fair performance comparisons.
Authors:Fenghe Tang, Wenxin Ma, Zhiyang He, Xiaodong Tao, Zihang Jiang, S. Kevin Zhou
Abstract:
With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM's semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.
中文: 本文发现将预训练的大型语言模型冻结层整合到CNN分割框架中,能在仅少量增加参数的情况下提升多种医学影像模态的分割性能,通过迁移语言模型的语义感知能力增强全局理解和局部建模。
English: This paper demonstrates that integrating a frozen pre-trained Large Language Model layer into a CNN-based medical image segmentation framework enhances performance across multiple imaging modalities with minimal parameter increase, leveraging the LLM's semantic awareness for improved global and local modeling.
Authors:Junjian Li, Hulin Kuang, Jin Liu, Hailin Yue, Mengshen He, Jianxin Wang
Abstract:
Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at https://github.com/junjianli106/MiCo.
中文摘要:提出的MiCo框架通过聚类实例捕捉跨区域组织相关性并整合语义锚点,增强了组织病理学图像的多示例学习能力,在九个癌症数据集上展现出优越性能。
English Summary: The proposed MiCo framework enhances multiple instance learning for histopathology image analysis by clustering instances to capture cross-regional tissue correlations and consolidating semantic anchors, demonstrating superior performance across nine cancer datasets.
Authors:Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
Abstract:
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
Chinese: PP-DocBee2通过优化数据质量和视觉特征融合策略,在中文商务文档理解任务中性能提升11.4%,推理延迟降低73.0%,显著提升了多模态文档理解能力。
English: PP-DocBee2 significantly improves multimodal document understanding with enhanced data quality and feature fusion, boosting performance by 11.4% and reducing latency by 73.0% compared to its predecessor.
Authors:Chenyue Song, Chen Hui, Qing Lin, Wei Zhang, Siqiao Li, Haiqi Zhu, Zhixuan Li, Shengping Zhang, Shaohui Liu, Feng Jiang, Xiang Li
Abstract:
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/scy-Jackel/LVPNet.
中文: 提出的LVPNet方法通过全局潜在变量预测像素值,并采用多尺度感知模块和量化补偿模块来捕捉空间依赖性和弥补量化损失,从而在无损医学图像压缩中实现了更优的压缩效率。
English: The proposed LVPNet method enhances lossless medical image compression by utilizing global latent variables for pixel prediction and incorporating modules to capture spatial dependencies and compensate for quantization errors, achieving superior compression efficiency.
Authors:Mischa Dombrowski, Bernhard Kainz
Abstract:
Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at https://github.com/MischaD/Trichotomy.
中文: 本研究提出了一种基于隐私保护的扩散模型训练框架,能够生成既接近真实数据性能又防止个人被识别的合成数据集,其表现显著优于未保障隐私的现有方法。
English: This study introduces a privacy-focused framework for training diffusion models on personal data, producing synthetic datasets that achieve near-real data performance while ensuring protection against individual identification and significantly outperforming existing non-private methods.
Authors:Bolin Shen, Eren Erman Ozguven, Yue Zhao, Guang Wang, Yiqun Xie, Yushun Dong
Abstract:
Florida is particularly vulnerable to hurricanes, which frequently cause substantial economic losses. While prior studies have explored specific contributors to hurricane-induced damage, few have developed a unified framework capable of integrating a broader range of influencing factors to comprehensively assess the sources of economic loss. In this study, we propose a comprehensive modeling framework that categorizes contributing factors into three key components: (1) hurricane characteristics, (2) water-related environmental factors, and (3) socioeconomic factors of affected areas. By integrating multi-source data and aggregating all variables at the finer spatial granularity of the ZIP Code Tabulation Area (ZCTA) level, we employ machine learning models to predict economic loss, using insurance claims as indicators of incurred damage. Beyond accurate loss prediction, our approach facilitates a systematic assessment of the relative importance of each component, providing practical guidance for disaster mitigation, risk assessment, and the development of adaptive urban strategies in coastal and storm-exposed areas. Our code is now available at: https://github.com/LabRAI/Hurricane-Induced-Economic-Loss-Prediction
中文: 本研究提出一个综合建模框架,在邮政编码级别整合飓风特征、环境因素和社会经济变量,以预测经济损失并评估影响因素,为佛罗里达等飓风频发地区的灾害管理提供改进方案。
English: This study introduces a unified modeling framework integrating hurricane characteristics, environmental factors, and socioeconomic variables at the ZIP code level to predict economic losses and assess contributing factors for improved disaster management in hurricane-prone regions like Florida.
Authors:Quanwei Tang, Sophia Yat Mei Lee, Junshuang Wu, Dong Zhang, Shoushan Li, Erik Cambria, Guodong Zhou
Abstract:
Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \href{https://github.com/tangquanwei/GraphMPA}{GraphMPA}.
中文:GraphMPA框架通过构建分层文档图和采用模式寻求偏好优化,有效提升了检索增强生成的性能,使模型输出更符合人类认知与伦理标准。
English: The proposed GraphMPA framework enhances retrieval-augmented generation by constructing hierarchical document graphs and employing mode-seeking preference optimization to better align model outputs with human cognitive and ethical standards.
Authors:Fei Zhou
Abstract:
Remote sensing change detection is used in urban planning, terrain analysis, and environmental monitoring by analyzing feature changes in the same area over time. In this paper, we propose a large language model (LLM) augmented inference approach (SegChange-R1), which enhances the detection capability by integrating textual descriptive information and guides the model to focus on relevant change regions, accelerating convergence. We designed a linear attention-based spatial transformation module (BEV) to address modal misalignment by unifying features from different times into a BEV space. Furthermore, we introduce DVCD, a novel dataset for building change detection from UAV viewpoints. Experiments on four widely-used datasets demonstrate significant improvements over existing method The code and pre-trained models are available in {https://github.com/Yu-Zhouz/SegChange-R1}.
中文摘要:本文提出SegChange-R1方法,通过集成文本描述信息和设计BEV空间对齐模块来增强遥感变化检测性能,并在包括新型无人机建筑变化数据集在内的多个数据集上验证了其优越性。
English Summary: This paper introduces SegChange-R1, a large language model-augmented method that improves remote sensing change detection by incorporating textual descriptions and aligning temporal features in a unified BEV space, validated on multiple datasets including a new UAV-based building change dataset.
Authors:Jianyu Wang, Zhiqiang Hu, Lidong Bing
Abstract:
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent "gibberish" can remarkably improve performance across diverse tasks. Notably, the "gibberish" always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature--such as symbiosis and self-organization--arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
中文: 本研究提出了一种新颖的提示设计范式,证明将随机示例修剪为看似不连贯的"乱码"能在多种任务中超越传统提示方法和最优自动优化技术,同时开发了PromptQuine进化框架,仅需少量数据即可自主发现有效的修剪策略。
English: This study introduces a novel prompt design paradigm that demonstrates pruning random demonstrations into seemingly incoherent "gibberish" can outperform conventional prompting methods and state-of-the-art optimization techniques across various tasks, while also proposing PromptQuine, an evolutionary framework that autonomously discovers effective pruning strategies with minimal data.
Authors:Jianghong Huang, Luping Ji, Xin Ma, Mao Ye
Abstract:
Conveyor belts are important equipment in modern industry, widely applied in production and manufacturing. Their health is much critical to operational efficiency and safety. Cracks are a major threat to belt health. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. Cracks are a major threat to belt health. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain ($i.e.$, time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at https://github.com/UESTC-nnLab/BeltCrack.
中文: 本研究提出了两个全新的工业传送带真实裂纹数据集及基于时-空-频三域特征分层融合的基线方法,实验证明该方法显著优于同类检测方案。
English: This study introduces two novel real-world industrial conveyor belt crack datasets and a baseline method using triple-domain feature fusion learning, which proves more effective than existing detection approaches.
Authors:Jiahao Lu, Jiacheng Deng
Abstract:
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features. In light of these disadvantages, we propose \textbf{Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation}. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features. Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism. Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.
中文: 本文提出了Relation3D方法,通过自适应超点聚合、对比学习引导的超点优化以及关系感知的自注意力机制来增强三维实例分割中的关系建模,在多个基准数据集上实现了优越性能。
English: This paper introduces Relation3D, a novel method for 3D instance segmentation that enhances relation modeling through adaptive superpoint aggregation, contrastive learning-guided refinement, and a relation-aware self-attention mechanism, achieving superior performance on multiple benchmark datasets.
Authors:Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
Abstract:
Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.
Chinese: 提出的GRAF方法通过全局优化攻击路径和自适应伪造模型响应来规避安全机制,在六种先进大语言模型上的实验表明其多轮越狱效果显著优于现有方法。
English: The proposed GRAF method enhances multi-turn jailbreaking by globally refining attack strategies and adaptively fabricating responses to bypass safety measures, demonstrating superior effectiveness across six advanced LLMs compared to existing techniques.
Authors:Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. Nevertheless, they still pose notable safety risks due to potential misuse for malicious purposes. Jailbreaking, which seeks to induce models to generate harmful content through single-turn or multi-turn attacks, plays a crucial role in uncovering underlying security vulnerabilities. However, prior methods, including sophisticated multi-turn approaches, often struggle to adapt to the evolving dynamics of dialogue as interactions progress. To address this challenge, we propose \ours (JailBreaking via \textbf{G}lobally \textbf{R}efining and \textbf{A}daptively \textbf{F}abricating), a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction. In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent queries. Extensive experiments across six state-of-the-art LLMs demonstrate the superior effectiveness of our approach compared to existing single-turn and multi-turn jailbreaking methods. Our code will be released at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.
Chinese: 提出的GRAF方法通过全局优化攻击路径和自适应伪造模型响应来规避安全机制,在六种先进大语言模型上的实验表明其多轮越狱效果显著优于现有方法。
English: The proposed GRAF method enhances multi-turn jailbreaking by globally refining attack strategies and adaptively fabricating responses to bypass safety measures, demonstrating superior effectiveness across six advanced LLMs compared to existing techniques.
Authors:Tam Trinh, Manh Nguyen, Truong-Son Hy
Abstract:
The rapid spread of misinformation in the digital era poses significant challenges to public discourse, necessitating robust and scalable fact-checking solutions. Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content, prompting the integration of automated systems powered by Large Language Models (LLMs). However, existing automated approaches often face limitations, such as handling complex claims, ensuring source credibility, and maintaining transparency. This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability. The system comprises four specialized agents: an Input Ingestion Agent for claim decomposition, a Query Generation Agent for formulating targeted subqueries, an Evidence Retrieval Agent for sourcing credible evidence, and a Verdict Prediction Agent for synthesizing veracity judgments with human-interpretable explanations. Evaluated on benchmark datasets (FEVEROUS, HOVER, SciFact), the proposed system achieves a 12.3% improvement in Macro F1-score over baseline methods. The system effectively decomposes complex claims, retrieves reliable evidence from trusted sources, and generates transparent explanations for verification decisions. Our approach contributes to the growing field of automated fact-checking by providing a more accurate, efficient, and transparent verification methodology that aligns with human fact-checking practices while maintaining scalability for real-world applications. Our source code is available at https://github.com/HySonLab/FactAgent
中文: 本文提出了一种新颖的多智能体自动事实核查系统,通过分解声明、检索可信证据并生成透明解释,提高了准确性、效率和可解释性,相比基准方法在宏观F1分数上提升了12.3%。
English: This paper introduces a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability by decomposing claims, retrieving credible evidence, and generating transparent explanations, achieving a 12.3% improvement in Macro F1-score over baseline methods.
Authors:Chenghao Yang, Ari Holtzman
Abstract:
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
Chinese: 对齐后的大型语言模型输出多样性降低源于概率集中现象,通过分支因子量化发现生成过程中分支因子递减且对齐调优使其大幅降低,这解释了模型稳定性及对解码策略不敏感的原因,并揭示了思维链模型通过延长推理进入确定性阶段实现稳定输出的机制。
English: Aligned large language models exhibit reduced output diversity due to probability concentration, quantified by the Branching Factor which decreases during generation and is substantially lowered by alignment tuning, explaining their stability and insensitivity to decoding strategies.
Authors:Jianhang Xie, Chuntao Ding, Xiaqing Li, Shenyuan Ren, Yidong Li, Zhichao Lu
Abstract:
Deploying quantized deep neural network (DNN) models with resource adaptation capabilities on ubiquitous Internet of Things (IoT) devices to provide high-quality AI services can leverage the benefits of compression and meet multi-scenario resource requirements. However, existing dynamic/mixed precision quantization requires retraining or special hardware, whereas post-training quantization (PTQ) has two limitations for resource adaptation: (i) The state-of-the-art PTQ methods only provide one fixed bitwidth model, which makes it challenging to adapt to the dynamic resources of IoT devices; (ii) Deploying multiple PTQ models with diverse bitwidths consumes large storage resources and switching overheads. To this end, this paper introduces a resource-friendly post-training integer-nesting quantization, i.e., NestQuant, for on-device quantized model switching on IoT devices. The proposed NestQuant incorporates the integer weight decomposition, which bit-wise splits quantized weights into higher-bit and lower-bit weights of integer data types. It also contains a decomposed weights nesting mechanism to optimize the higher-bit weights by adaptive rounding and nest them into the original quantized weights. In deployment, we can send and store only one NestQuant model and switch between the full-bit/part-bit model by paging in/out lower-bit weights to adapt to resource changes and reduce consumption. Experimental results on the ImageNet-1K pretrained DNNs demonstrated that the NestQuant model can achieve high performance in top-1 accuracy, and reduce in terms of data transmission, storage consumption, and switching overheads. In particular, the ResNet-101 with INT8 nesting INT6 can achieve 78.1% and 77.9% accuracy for full-bit and part-bit models, respectively, and reduce switching overheads by approximately 78.1% compared with diverse bitwidths PTQ models.
中文: 本文提出NestQuant这一资源友好的训练后量化方法,通过整数权重分解与嵌套机制实现物联网设备上的动态模型切换,在保持高精度的同时显著降低了存储与传输开销。
English: This paper introduces NestQuant, a resource-friendly post-training quantization method that enables dynamic model switching on IoT devices by decomposing weights into integer-nested formats, reducing storage and transmission overhead while maintaining high accuracy.
Authors:Xiaodong Guo, Zi'ang Lin, Luwen Hu, Zhihong Deng, Tong Liu, Wujie Zhou
Abstract:
The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.
中文摘要:提出的CM-SSM架构通过跨模态状态空间建模有效融合RGB与热成像数据,在实现线性计算复杂度的同时,以更少参数和更低计算成本获得了最先进的语义分割性能。
English Summary: The proposed CM-SSM architecture efficiently integrates RGB and thermal data for semantic segmentation using cross-modal state space modeling, achieving superior performance with linear computational complexity compared to Transformer-based methods.
Authors:Yingcheng Liu, Peiqi Wang, Sebastian Diaz, Esra Abaci Turk, Benjamin Billot, P. Ellen Grant, Polina Golland
Abstract:
Analyzing fetal body motion and shape is paramount in prenatal diagnostics and monitoring. Existing methods for fetal MRI analysis mainly rely on anatomical keypoints or volumetric body segmentations. Keypoints simplify body structure to facilitate motion analysis, but may ignore important details of full-body shape. Body segmentations capture complete shape information but complicate temporal analysis due to large non-local fetal movements. To address these limitations, we construct a 3D articulated statistical fetal body model based on the Skinned Multi-Person Linear Model (SMPL). Our algorithm iteratively estimates body pose in the image space and body shape in the canonical pose space. This approach improves robustness to MRI motion artifacts and intensity distortions, and reduces the impact of incomplete surface observations due to challenging fetal poses. We train our model on segmentations and keypoints derived from $19,816$ MRI volumes across $53$ subjects. Our model captures body shape and motion across time series and provides intuitive visualization. Furthermore, it enables automated anthropometric measurements traditionally difficult to obtain from segmentations and keypoints. When tested on unseen fetal body shapes, our method yields a surface alignment error of $3.2$ mm for $3$ mm MRI voxel size. To our knowledge, this represents the first 3D articulated statistical fetal body model, paving the way for enhanced fetal motion and shape analysis in prenatal diagnostics. The code is available at https://github.com/MedicalVisionGroup/fetal-smpl .
中文摘要:本研究首次开发了基于SMPL的三维关节式统计胎儿身体模型,通过迭代估计图像空间中的身体姿态和标准姿态空间中的身体形状,显著提升了对MRI运动伪影的鲁棒性,并实现了传统方法难以获得的自动化人体测量功能。
English Summary: This study introduces the first 3D articulated statistical fetal body model that simultaneously captures body shape and motion from MRI data, improving robustness to artifacts and enabling automated anthropometric measurements with 3.2 mm surface alignment accuracy.
Authors:Suyash Gaurav, Jukka Heikkonen, Jatin Chaudhary
Abstract:
Continual learning systems face the dual challenge of preventing catastrophic forgetting while maintaining energy efficiency, particularly in resource-constrained environments. This paper introduces Pathway-based Progressive Inference (PaPI), a novel theoretical framework that addresses these challenges through a mathematically rigorous approach to pathway selection and adaptation. We formulate continual learning as an energy-constrained optimization problem and provide formal convergence guarantees for our pathway routing mechanisms. Our theoretical analysis demonstrates that PaPI achieves an $\mathcal{O}(K)$ improvement in the stability-plasticity trade-off compared to monolithic architectures, where $K$ is the number of pathways. We derive tight bounds on forgetting rates using Fisher Information Matrix analysis and prove that PaPI's energy consumption scales with the number of active parameters rather than the total model size. Comparative theoretical analysis shows that PaPI provides stronger guarantees against catastrophic forgetting than Elastic Weight Consolidation (EWC) while maintaining better energy efficiency than both EWC and Gradient Episodic Memory (GEM). Our experimental validation confirms these theoretical advantages across multiple benchmarks, demonstrating PaPI's effectiveness for continual learning in energy-constrained settings. Our codes are available at https://github.com/zser092/PAPI_FILES.
中文: 本文提出基于路径的渐进推理(PaPI)理论框架,通过路径选择和适应的数学方法,在持续学习中优化稳定性与可塑性的平衡,并显著提高能源效率,同时降低遗忘率。
English: This paper introduces Pathway-based Progressive Inference (PaPI), a theoretical framework that enhances continual learning by improving the stability-plasticity trade-off and energy efficiency through pathway selection and adaptation, with proven convergence and reduced forgetting rates.
Authors:Anton Melnychuk, Bryan SebaRaj
Abstract:
We present the first open-source implementation and evaluation of Fast Raft, a hierarchical consensus protocol designed for dynamic, distributed environments. Fast Raft reduces the number of message rounds needed to commit log entries compared to standard Raft by introducing a fast-track mechanism and reducing leader dependence. Our implementation uses gRPC and Kubernetes-based deployment across AWS availability zones. Experimental results demonstrate a throughput improvement and reduced commit latency under low packet loss conditions, while maintaining Raft's safety and liveness guarantees.
中文: 本研究首次开源实现了Fast Raft分层共识协议,通过减少消息轮次和领导者依赖,在分布式系统中显著提升了吞吐量并降低了延迟。
English: This study introduces the first open-source implementation of Fast Raft, a hierarchical consensus protocol that enhances throughput and reduces latency in distributed systems by minimizing message rounds and leader reliance.
Authors:Keigo Nishida, Eren Mehmet Kıral, Kenichi Bannai, Mohammad Emtiyaz Khan, Thomas Möllenhoff
Abstract:
Studies in neuroscience have shown that biological synapses follow a log-normal distribution whose transitioning can be explained by noisy multiplicative dynamics. Biological networks can function stably even under dynamically fluctuating conditions arising due to unreliable synaptic transmissions. Here we ask: Is it possible to design similar multiplicative training in artificial neural networks? To answer this question, we derive a Bayesian learning rule that assumes log-normal posterior distributions over weights which gives rise to a new Log-Normal Multiplicative Dynamics (LMD) algorithm. The algorithm uses multiplicative updates with both noise and regularization applied multiplicatively. The method is as easy to implement as Adam and only requires one additional vector to store. Our results show that LMD achieves stable and accurate training-from-scratch under low-precision forward operations for Vision Transformer and GPT-2. These results suggest that multiplicative dynamics, a biological feature, may enable stable low-precision inference and learning on future energy-efficient hardware.
Chinese: 该研究受生物突触分布启发,提出了一种对数正态乘法动力学(LMD)算法,能在低精度条件下稳定准确地训练视觉Transformer和GPT-2等神经网络,显示出其在未来节能硬件中的应用潜力。
English: The study introduces a Log-Normal Multiplicative Dynamics (LMD) algorithm, inspired by biological synaptic distributions, which enables stable and accurate training of neural networks like Vision Transformer and GPT-2 under low-precision conditions, suggesting its potential for energy-efficient hardware.
Authors:Fadi Abdeladhim Zidi, Djamel Eddine Boukhari, Abdellah Zakaria Sellam, Abdelkrim Ouafi, Cosimo Distante, Salah Eddine Bekhouche, Abdelmalik Taleb-Ahmed
Abstract:
Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.
中文: 本文提出的LoLA-SpecViT轻量化光谱视觉Transformer,通过结合三维卷积光谱前端与局部自注意力机制,并引入低秩自适应技术,在高光谱图像分类任务中以更少参数实现了最优精度。
English: This paper introduces LoLA-SpecViT, a lightweight spectral vision transformer that enhances hyperspectral image classification by combining 3D convolutional spectral processing with local self-attention and low-rank adaptation, achieving state-of-the-art accuracy with significantly reduced parameters.
Authors:Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, Yue Gao
Abstract:
The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0\% over YOLO11-N and by 1.5\% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.
Chinese: YOLOv13通过基于超图的自适应相关性增强机制和全流程聚合-分配范式,解决了先前模型在捕捉全局高阶相关性方面的局限,在MS COCO基准测试中以更少的计算量实现了最优性能。
English: YOLOv13 introduces a Hypergraph-based Adaptive Correlation Enhancement mechanism and a Full-Pipeline Aggregation-and-Distribution paradigm to overcome previous models' limitations in capturing global high-order correlations, achieving state-of-the-art performance with reduced computational complexity on the MS COCO benchmark.
Authors:Piyush Pradhan, Pierre Gentine, Shaina Kelly
Abstract:
We present JAX-LaB, a differentiable, Python-based Lattice Boltzmann library for simulating multiphase and multiphysics flows in hydrologic, geologic, and engineered porous media. Built as an extension of the XLB library, JAX-LaB utilizes JAX for computations and offers a performant, hardware-agnostic implementation that integrates seamlessly with machine learning workflows and scales efficiently across CPUs, GPUs, and distributed systems. Multiphase interactions are modeled using the Shan-Chen pseudopotential method, which is coupled with an equation of state and an improved forcing scheme to obtain liquid-vapor densities that are consistent with Maxwell's construction, enabling simulations of systems with very large density ratios while maintaining minimal spurious currents. Wetting is handled using the "improved" virtual density scheme, which allows precise control of contact angles and eliminates non-physical films seen in other Shan-Chen wetting methods. We validate the library through several analytical benchmarks, such as Laplace's law, capillary rise, and cocurrent multicomponent flow, and demonstrate some exemplary use cases for the library. We also report single- and multi-GPU performance scaling of the library. The library is open-source under the Apache license and available at https://github.com/piyush-ppradhan/JAX-LaB.
中文:JAX-LaB是一个基于Python的可微分格子玻尔兹曼库,用于模拟多孔介质中的多相流动,具有硬件无关的高性能特性,并能与机器学习工作流无缝集成。
English: JAX-LaB is a differentiable Python library for simulating multiphase flows in porous media using the Lattice Boltzmann method, featuring hardware-agnostic performance and seamless integration with machine learning workflows.
Authors:Amirshayan Nasirimajd, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone, Giuseppe Averta, Barbara Caputo
Abstract:
Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model's robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.
中文: 本文提出SeqDG方法,通过序列重构和混合训练提升第一人称动作识别模型在未知环境中的泛化能力,在EPIC-KITCHENS-100和EGTEA数据集上实现了显著准确率提升。
English: This paper introduces SeqDG, a domain generalization method for Egocentric Action Recognition that enhances model performance in unseen environments through sequence reconstruction and mixed training, achieving notable accuracy improvements on benchmark datasets.
Authors:Fabien Furfaro
Abstract:
Transformer-based large language models (LLMs) have achieved strong performance across many natural language processing tasks. Nonetheless, their quadratic computational and memory requirements, particularly in self-attention layers, pose challenges for efficient inference on long contexts and for deployment in resource-limited environments. We present TPTT (Transforming Pretrained Transformers into Titans), a framework designed to augment pretrained Transformers with linearized attention (LiZA) and internal memory gating via Memory as Gate (MaG), applied without full retraining. TPTT supports parameter-efficient fine-tuning (LoRA) and integrates with standard toolkits such as Hugging Face Transformers. We evaluated TPTT on several pretrained models, including Llama-1B, OlMoE-1B-7B, Qwen2.5-1.5B, Gemma3-270m, OpenELM-1.3B, and Mistral-7B, in order to assess applicability across architectures of different scales. Experiments on models with approximately 1 billion parameters, evaluated primarily on the MMLU benchmark, suggest potential improvements in both efficiency and accuracy compared to baseline models. For example, Titans-Llama-1B exhibited up to a 20\% relative increase in Exact Match scores in one-shot evaluation. An additional finding is that it is possible to convert a quadratic-attention model into a purely linear-attention model using the DeltaProduct mechanism. All training runs were carried out with modest computational resources. These preliminary findings indicate that TPTT may help adapt pretrained LLMs for long-context tasks with limited overhead. Further studies on larger models and a broader set of benchmarks will be necessary to evaluate the generality and robustness of the framework. Code is available at https://github.com/fabienfrfr/tptt . Python package at https://pypi.org/project/tptt/ .
中文:TPTT框架通过线性化注意力和内存门控增强预训练Transformer,可在有限资源下高效适应长文本任务,同时保持或提升模型性能。
English: The TPTT framework enhances pretrained Transformers with linearized attention and memory gating, enabling efficient adaptation for long-context tasks while maintaining or improving performance with minimal computational resources.
Authors:Shaoyu Yang, Chunrong Fang, Haifeng Lin, Xiang Chen, Zhenyu Chen
Abstract:
Deep Learning (DL) frameworks have served as fundamental components in DL systems over the last decade. However, bugs in DL frameworks could lead to catastrophic consequences in critical scenarios. A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Existing approaches focus on test generation, leaving execution results with high semantic value (e.g., coverage information, bug reports, and exception logs) in the wild, which can serve as multiple types of feedback. To fill this gap, we propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM. Specifically, analysis LLM infers analysis summaries from feedback information, while the generation LLM creates tests guided by these summaries. Furthermore, based on multiple feedback guidance, we design two additional components: (i) a feedback-aware simulated annealing algorithm to select operators for test generation, enriching test diversity. (ii) a program self-repair strategy to automatically repair invalid tests, enhancing test validity. We evaluate FUEL on the two most popular DL frameworks, and experiment results show that FUEL can improve line code coverage of PyTorch and TensorFlow by 9.15% and 14.70% over state-of-the-art baselines (e.g., TitanFuzz and WhiteFox). By the time of submission, FUEL has detected 104 previously unknown bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 49 already fixed, and 14 assigned CVE IDs. Our artifact is available at https://github.com/NJU-iSE/FUEL
中文: FUEL是一种创新的模糊测试方法,通过两个大型语言模型分析反馈并生成测试,显著提升了PyTorch和TensorFlow等主流深度学习框架的代码覆盖率,并检测出大量未知漏洞。
English: FUEL is a novel fuzz testing approach that utilizes two large language models to analyze feedback and generate tests, significantly improving code coverage and detecting numerous bugs in popular deep learning frameworks like PyTorch and TensorFlow.
Authors:Yang Wu, Yifan Zhang, Yurong Wu, Yuran Wang, Junkai Zhang, Jian Cheng
Abstract:
Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt--a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01\% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs.The code and dataset are available at https://github.com/samwu-learn/Step.
中文摘要:本文提出Step-Opt-Instruct框架,通过迭代式问题生成和逐步验证机制为运筹学优化建模生成高质量训练数据,基于此开发的Step-Opt模型在多项基准测试中实现最优性能,尤其在复杂问题上取得17.01%的显著准确率提升。
English Summary: This paper introduces Step-Opt-Instruct, a framework that enhances optimization modeling for Operations Research by generating high-quality training data through iterative complexity escalation and stepwise validation, resulting in the Step-Opt model which achieves state-of-the-art performance with significant accuracy improvements on complex tasks.
Authors:Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang
Abstract:
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.
中文: CLiViS是一种无需训练的框架,它结合了大型语言模型的任务规划能力和视觉语言模型的感知能力,通过动态认知地图连接感知与推理,以在复杂环境中实现高效的具身视觉推理。
English: CLiViS is a training-free framework that synergizes LLMs for task planning and VLMs for visual perception, using a dynamic Cognitive Map to bridge perception and reasoning for effective embodied visual reasoning in complex environments.
Authors:Mihir Godbole, Xiangbo Gao, Zhengzhong Tu
Abstract:
Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.
中文摘要:本研究提出DRAMA-X基准,用于评估自动驾驶车辆在安全关键场景中对行人及骑行者意图的预测能力,并通过结合场景图与语言模型的新框架,显著提升了意图推理和风险评估的准确性。
English Summary: This research introduces DRAMA-X, a benchmark for evaluating autonomous vehicles' ability to predict pedestrian and cyclist intentions in safety-critical scenarios, proposing a novel framework that uses scene graphs and language models to improve intent reasoning and risk assessment.
Authors:Furong Peng, Jinzhen Gao, Xuan Lu, Kang Liu, Yifan Huo, Sheng Wang
Abstract:
Graph Convolutional Networks (GCNs) suffer from severe performance degradation in deep architectures due to over-smoothing. While existing studies primarily attribute the over-smoothing to repeated applications of graph Laplacian operators, our empirical analysis reveals a critical yet overlooked factor: trainable linear transformations in GCNs significantly exacerbate feature collapse, even at moderate depths (e.g., 8 layers). In contrast, Simplified Graph Convolution (SGC), which removes these transformations, maintains stable feature diversity up to 32 layers, highlighting linear transformations' dual role in facilitating expressive power and inducing over-smoothing. However, completely removing linear transformations weakens the model's expressive capacity. To address this trade-off, we propose Layer-wise Gradual Training (LGT), a novel training strategy that progressively builds deep GCNs while preserving their expressiveness. LGT integrates three complementary components: (1) layer-wise training to stabilize optimization from shallow to deep layers, (2) low-rank adaptation to fine-tune shallow layers and accelerate training, and (3) identity initialization to ensure smooth integration of new layers and accelerate convergence. Extensive experiments on benchmark datasets demonstrate that LGT achieves state-of-the-art performance on vanilla GCN, significantly improving accuracy even in 32-layer settings. Moreover, as a training method, LGT can be seamlessly combined with existing methods such as PairNorm and ContraNorm, further enhancing their performance in deeper networks. LGT offers a general, architecture-agnostic training framework for scalable deep GCNs. The code is available at [https://github.com/jfklasdfj/LGT_GCN].
中文: 深度图卷积网络因可训练线性变换加剧过平滑问题,而提出的逐层渐进训练方法通过渐进构建网络、低秩适应和恒等初始化,在保持表达能力的同时显著提升了深层网络的性能。
English: Deep Graph Convolutional Networks (GCNs) face performance degradation from over-smoothing, which is exacerbated by trainable linear transformations, but the proposed Layer-wise Gradual Training (LGT) method overcomes this by progressively building deep networks while maintaining expressiveness and achieving state-of-the-art results.
Authors:Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci
Abstract:
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
中文: 本文提出ConsumerBench基准测试框架,通过在终端设备上模拟多应用并发场景来评估生成式AI性能,揭示了资源分配低效问题,并为开发人员提供了优化建议。
English: This paper introduces ConsumerBench, a benchmarking framework that evaluates GenAI performance on end-user devices under realistic multi-application scenarios, revealing resource inefficiencies and providing optimization insights for developers.
Authors:Julio Silva-RodrÃguez, Ismail Ben Ayed, Jose Dolz
Abstract:
Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs' pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.
中文: 本研究提出了一种转导式分割共形适应(SCA-T)框架,通过在校准和测试数据上进行无监督转导适应,提升了医学视觉语言模型在迁移学习中的可靠性和预测效率,优于传统共形方法。
English: This study introduces a transductive split conformal adaptation (SCA-T) framework to enhance the reliability of medical vision-language models by ensuring efficient and conditionally accurate predictions during transfer learning, outperforming standard conformal methods.
Authors:Julio Silva-RodrÃguez, Fereshteh Shakeri, Houda Bahig, Jose Dolz, Ismail Ben Ayed
Abstract:
Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.
中文摘要:视觉语言模型在医学影像分析中潜力显著,但现有方法依赖平衡数据和验证集等不切实际的假设;本研究提出无需训练的自适应融合方法,在更贴近现实的不平衡场景下实现了鲁棒性能。
English Summary: Vision-language models show promise in medical imaging but rely on unrealistic assumptions about balanced data and validation sets, prompting this study to propose a training-free adaptive method that performs robustly under more practical, imbalanced conditions.
Authors:Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, Kaidi Xu
Abstract:
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.
中文: 本文提出UProp框架,将大语言模型序列决策的不确定性分解为内在和外在两部分,在多步决策基准测试中显著优于现有方法。
English: This paper introduces UProp, a novel framework that decomposes LLM sequential decision uncertainty into intrinsic and extrinsic components, significantly outperforming existing methods in multi-step decision-making benchmarks.
Authors:Zijun Sun, Solveig Thrun, Michael Kampffmeyer
Abstract:
Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploiting temporal information to capture evolving trends in breast tissue, as inspired by clinical practice. Early methods typically relied on two time steps, and although recent efforts have extended this to multiple time steps using Transformer architectures, challenges remain in fully harnessing the rich temporal dynamics inherent in longitudinal imaging data. In this work, we propose to instead leverage Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to effectively capture nuanced trends in breast tissue evolution. To further enhance our approach, we incorporate an asymmetry module that utilizes a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) to identify clinically relevant bilateral differences. This integrated framework demonstrates notable improvements in predicting cancer onset, especially for the more challenging high-density breast cases and achieves superior performance at extended time points (years four and five), highlighting its potential to advance early breast cancer recognition and enable more personalized screening strategies. Our code is available at https://github.com/Mortal-Suen/VMRA-MaR.git.
中文: 本研究提出了一种新颖的Vision Mamba RNN框架,通过整合不对称性检测模块来捕捉乳腺组织的细微演化趋势和双侧差异,显著提升了高风险病例和长期筛查中的乳腺癌预测性能。
English: This study introduces a novel Vision Mamba RNN framework enhanced with asymmetry detection to improve breast cancer risk prediction by capturing nuanced tissue evolution trends and bilateral differences, demonstrating superior performance in challenging cases and long-term screenings.
Authors:Sunjun Kweon, Sooyohn Nam, Hyunseung Lim, Hwajung Hong, Edward Choi
Abstract:
Virtual Teaching Assistants (VTAs) powered by Large Language Models (LLMs) have the potential to enhance student learning by providing instant feedback and facilitating multi-turn interactions. However, empirical studies on their effectiveness and acceptance in real-world classrooms are limited, leaving their practical impact uncertain. In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students. To assess how student perceptions of the VTA's performance evolve over time, we conduct three rounds of comprehensive surveys at different stages of the course. Additionally, we analyze 3,869 student--VTA interaction pairs to identify common question types and engagement patterns. We then compare these interactions with traditional student--human instructor interactions to evaluate the VTA's role in the learning process. Through a large-scale empirical study and interaction analysis, we assess the feasibility of deploying VTAs in real-world classrooms and identify key challenges for broader adoption. Finally, we release the source code of our VTA system, fostering future advancements in AI-driven education: \texttt{https://github.com/sean0042/VTA}.
中文: 本研究开发了一个基于大语言模型的虚拟教学助手,在477名学生的真实课堂中部署,通过调查和互动分析评估其有效性,并指出了广泛采用面临的主要挑战。
English: This study develops a large language model-based virtual teaching assistant, deploys it in a real classroom with 477 students, and through surveys and interaction analysis, evaluates its effectiveness and identifies key challenges for broader adoption.
Authors:Haitian Wang, Yiren Wang, Xinyu Wang, Yumeng Miao, Yuliang Zhang, Yu Zhang, Atif Mansoor
Abstract:
By 2050, people aged 65 and over are projected to make up 16 percent of the global population. As aging is closely associated with increased fall risk, particularly in wet and confined environments such as bathrooms where over 80 percent of falls occur. Although recent research has increasingly focused on non-intrusive, privacy-preserving approaches that do not rely on wearable devices or video-based monitoring, these efforts have not fully overcome the limitations of existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based), which are prone to reduced accuracy in complex environments. These limitations stem from fundamental constraints in unimodal sensing, including system bias and environmental interference, such as multipath fading in WiFi-based systems and drastic temperature changes in infrared-based methods. To address these challenges, we propose a Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments. First, we develop a sensor evaluation framework to select and fuse millimeter-wave radar with 3D vibration sensing, and use it to construct and preprocess a large-scale, privacy-preserving multimodal dataset in real bathroom settings, which will be released upon publication. Second, we introduce P2MFDS, a dual-stream network combining a CNN-BiLSTM-Attention branch for radar motion dynamics with a multi-scale CNN-SEBlock-Self-Attention branch for vibration impact detection. By uniting macro- and micro-scale features, P2MFDS delivers significant gains in accuracy and recall over state-of-the-art approaches. Code and pretrained models will be made available at: https://github.com/HaitianWang/P2MFDS-A-Privacy-Preserving-Multimodal-Fall-Detection-Network-for-Elderly-Individuals-in-Bathroom.
中文: 针对205年全球65岁以上人口将占16%且浴室跌倒高发的现状,本研究提出P2MFDS多模态监测系统,通过融合毫米波雷达与振动传感技术,在保护隐私的前提下实现了较现有方法更优的检测精度与召回率。
English: By 2050, the global elderly population will reach 16%, with bathrooms being high-risk fall areas, prompting the development of P2MFDS—a privacy-preserving multimodal system using radar and vibration sensors that significantly outperforms existing methods in accuracy and recall.
Authors:Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, Norbert Tihanyi
Abstract:
Detecting AI-generated code, deepfakes, and other synthetic content is an emerging research challenge. As code generated by Large Language Models (LLMs) becomes more common, identifying the specific model behind each sample is increasingly important. This paper presents the first systematic study of LLM authorship attribution for C programs. We released CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture, discarding the decoder to focus on classification. Our model's encoder output (first token) is passed through a two-layer classification head with GELU activation and dropout, producing a probability distribution over possible authors. To evaluate our approach, we introduce LLM-AuthorBench, a benchmark of 32,000 compilable C programs generated by eight state-of-the-art LLMs across diverse tasks. We compare our model to seven traditional ML classifiers and eight fine-tuned transformer models, including BERT, RoBERTa, CodeBERT, ModernBERT, DistilBERT, DeBERTa-V3, Longformer, and LoRA-fine-tuned Qwen2-1.5B. In binary classification, our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models such as GPT-4.1 and GPT-4o, and 95.40% accuracy for multi-class attribution among five leading LLMs (Gemini 2.5 Flash, Claude 3.5 Haiku, GPT-4.1, Llama 3.3, and DeepSeek-V3). To support open science, we release the CodeT5-Authorship architecture, the LLM-AuthorBench benchmark, and all relevant Google Colab scripts on GitHub: https://github.com/LLMauthorbench/.
中文: 本文提出了CodeT5-Authorship模型,用于识别C程序的具体大语言模型作者,并发布了包含32,000个可编译C程序的LLM-AuthorBench基准,在二元和多元分类任务中均实现了高准确率。
English: This paper introduces CodeT5-Authorship, a novel model for identifying the specific LLM authors of C programs, and presents LLM-AuthorBench, a benchmark of 32,000 compilable C programs, achieving high accuracy in both binary and multi-class attribution tasks.
Authors:Zhixiang Chi, Li Gu, Huan Liu, Ziqiang Wang, Yanan Wu, Yang Wang, Konstantinos N Plataniotis
Abstract:
Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP's strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP's prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method's superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of \textbf{+5.1} in F1 for iWildCam and \textbf{+3.1\%} in WC Acc for FMoW.
中文摘要:本文提出了一种新方法,通过在CLIP冻结特征基础上增加输入空间学习分支,结合逆向注意力和改进的文本特征分散技术,显著提升了小样本测试时域自适应性能,在多个基准测试中取得突破性进展。
English Summary: This paper introduces a novel method that enhances few-shot test-time domain adaptation by learning directly in the input space alongside CLIP's frozen features, using a side branch with revert attention and improved text feature dispersion, achieving significant performance gains on challenging benchmarks.
Authors:Yijun Lin, Theresa Chen, Colby Brungard, Grunwald Sabine, Sue Ives, Matt Macander, Timm Nawrocki, Yao-Yi Chiang, Nic Jelinski
Abstract:
Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.
中文: 本研究提出的MISO视觉机器学习模型,在阿拉斯加高分辨率土壤制图中优于传统方法,为气候变化下的冻土监测和基础设施规划提供了更有效的解决方案。
English: This study introduces MISO, a vision-based machine learning model that outperforms traditional methods in generating high-resolution soil maps for Alaska, enabling better permafrost monitoring and infrastructure planning amid climate change.
Authors:Satyam Mishra, Phung Thao Vi, Shivam Mishra, Vishwanath Bijalwan, Vijay Bhaskar Semwal, Abdul Manan Khan
Abstract:
We introduce SafeRL-Lite, an open-source Python library for building reinforcement learning (RL) agents that are both constrained and explainable. Existing RL toolkits often lack native mechanisms for enforcing hard safety constraints or producing human-interpretable rationales for decisions. SafeRL-Lite provides modular wrappers around standard Gym environments and deep Q-learning agents to enable: (i) safety-aware training via constraint enforcement, and (ii) real-time post-hoc explanation via SHAP values and saliency maps. The library is lightweight, extensible, and installable via pip, and includes built-in metrics for constraint violations. We demonstrate its effectiveness on constrained variants of CartPole and provide visualizations that reveal both policy logic and safety adherence. The full codebase is available at: https://github.com/satyamcser/saferl-lite.
中文: SafeRL-Lite 是一个开源 Python 库,通过模块化封装和实时解释工具,能够开发具有内置安全约束和可解释性功能的强化学习智能体。
English: SafeRL-Lite is an open-source Python library that enables the development of reinforcement learning agents with built-in safety constraints and explainability features through modular wrappers and real-time explanation tools.
Authors:Yuqi Li, Junhao Dong, Zeyu Dong, Chuanguang Yang, Zhulin An, Yongjun Xu
Abstract:
3D point cloud segmentation faces practical challenges due to the computational complexity and deployment limitations of large-scale transformer-based models. To address this, we propose a novel Structure- and Relation-aware Knowledge Distillation framework, named SRKD, that transfers rich geometric and semantic knowledge from a large frozen teacher model (>100M) to a lightweight student model (<15M). Specifically, we propose an affinity matrix-based relation alignment module, which distills structural dependencies from the teacher to the student through point-wise similarity matching, enhancing the student's capability to learn contextual interactions. Meanwhile, we introduce a cross-sample mini-batch construction strategy that enables the student to perceive stable and generalized geometric structure. This aligns across diverse point cloud instances of the teacher, rather than within a single sample. Additionally, KL divergence is applied to align semantic distributions, and ground-truth supervision further reinforces accurate segmentation. Our method achieves state of the art performance with significantly reduced model complexity, demonstrating its effectiveness and efficiency in real-world deployment scenarios. Our Code is available at https://github.com/itsnotacie/SRKD.
中文: 提出的SRKD框架通过关系对齐和跨样本训练,将大型教师模型的几何与语义知识高效迁移至轻量学生模型,以显著降低的复杂度实现了最先进的3D点云分割性能。
English: The proposed SRKD framework efficiently transfers geometric and semantic knowledge from a large teacher model to a lightweight student model through relation alignment and cross-sample training, achieving state-of-the-art 3D segmentation with significantly reduced complexity.
Authors:Jiale Zhang, Jiaxiang Chen, Zhucong Li, Jie Ding, Kui Zhao, Zenglin Xu, Xin Pang, Yinghui Xu
Abstract:
Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw -- semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input -- without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. https://github.com/continue-ai-company/SlimRAG
中文:SlimRAG提出了一种轻量级、以实体为中心的框架,通过摒弃图结构提升了检索精度与效率,在多项基准测试中以更小的索引规模和更简洁的检索机制实现了优于现有方法的准确率。
English: SlimRAG introduces a lightweight, entity-centric framework that eliminates graph structures to enhance retrieval precision and efficiency, outperforming existing methods in accuracy while significantly reducing index size and retrieval complexity.
Authors:Youzheng Liu, Jiyan Liu, Xiaoman Xu, Taihang Wang, Yimin Wang, Ye Jiang
Abstract:
This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: https://github.com/warmth27/SemEval2025_Task7
中文: 本文介绍了QUST_NLP团队为SemEval-2025任务7设计的三阶段检索框架,通过候选检索、多模型重排序和加权投票的融合方法,在单语和跨语言赛道分别获得第五名和第七名的成绩。
English: This paper presents QUST_NLP's three-stage retrieval framework for SemEval-2025 Task 7, which combines candidate retrieval, multi-model re-ranking, and weighted voting to achieve 5th and 7th place in monolingual and crosslingual tracks respectively.
Authors:Chenghan Li, Mingchen Li, Yipu Liao, Ruisheng Diao
Abstract:
Long-term time series prediction has predominantly relied on Transformer and MLP models, while the potential of convolutional networks in this domain remains underexplored. To address this gap, we introduce a novel multi-scale time series reshape module, which effectively captures the relationships among multi-period patches and variable dependencies. Building upon this module, we propose MS-TVNet, a multi-scale 3D dynamic convolutional neural network. Through comprehensive evaluations on diverse datasets, MS-TVNet demonstrates superior performance compared to baseline models, achieving state-of-the-art (SOTA) results in long-term time series prediction. Our findings highlight the effectiveness of leveraging convolutional networks for capturing complex temporal patterns, suggesting a promising direction for future research in this field.The code is realsed on https://github.com/Curyyfaust/TVNet.
Chinese: 本研究提出了MS-TVNet,一种多尺度三维动态卷积神经网络,通过有效捕捉多周期模式和变量依赖关系,在长期时间序列预测中实现了最先进的性能。
English: This study introduces MS-TVNet, a multi-scale 3D dynamic convolutional neural network that achieves state-of-the-art performance in long-term time series prediction by effectively capturing multi-period patterns and variable dependencies.
Authors:Fudong Lin, Jiadong Lou, Hao Wang, Brian Jalaian, Xu Yuan
Abstract:
Sparse attacks are to optimize the magnitude of adversarial perturbations for fooling deep neural networks (DNNs) involving only a few perturbed pixels (i.e., under the l0 constraint), suitable for interpreting the vulnerability of DNNs. However, existing solutions fail to yield interpretable adversarial examples due to their poor sparsity. Worse still, they often struggle with heavy computational overhead, poor transferability, and weak attack strength. In this paper, we aim to develop a sparse attack for understanding the vulnerability of CNNs by minimizing the magnitude of initial perturbations under the l0 constraint, to overcome the existing drawbacks while achieving a fast, transferable, and strong attack to DNNs. In particular, a novel and theoretical sound parameterization technique is introduced to approximate the NP-hard l0 optimization problem, making directly optimizing sparse perturbations computationally feasible. Besides, a novel loss function is designed to augment initial perturbations by maximizing the adversary property and minimizing the number of perturbed pixels simultaneously. Extensive experiments are conducted to demonstrate that our approach, with theoretical performance guarantees, outperforms state-of-the-art sparse attacks in terms of computational overhead, transferability, and attack strength, expecting to serve as a benchmark for evaluating the robustness of DNNs. In addition, theoretical and empirical results validate that our approach yields sparser adversarial examples, empowering us to discover two categories of noises, i.e., "obscuring noise" and "leading noise", which will help interpret how adversarial perturbation misleads the classifiers into incorrect predictions. Our code is available at https://github.com/fudong03/SparseAttack.
中文: 本文提出一种新型稀疏攻击方法,通过在l0约束下最小化初始扰动来高效生成高可解释性的对抗样本,在计算效率、可迁移性和攻击强度方面表现优异,同时揭示了两类噪声以解释深度神经网络的脆弱性。
English: This paper introduces a novel sparse attack method that efficiently generates highly interpretable adversarial examples by minimizing initial perturbations under l0 constraints, achieving superior computational efficiency, transferability, and attack strength while revealing two noise categories to explain DNN vulnerabilities.
Authors:Jianing He, Qi Zhang, Duoqian Miao, Yi Kun, Shufeng Hao, Hongyun Zhang, Zhihua Wei
Abstract:
Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19x across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at https://github.com/He-Jianing/NSP.git.
中文摘要:本研究提出的基于确定性感知概率(CAP)的早期退出方法,通过结合类别相关logits和衡量类别无关信息的NSP评分来改进预训练语言模型的早期退出机制,在GLUE基准测试中实现了2.19倍平均加速且性能损失可忽略,以28%优势超越现有最优方法。
English Summary: The proposed Certainty-Aware Probability (CAP) method enhances early exiting in pre-trained language models by integrating class-relevant logits with a novel NSP score that measures class-irrelevant information, achieving a 2.19x average speed-up on GLUE tasks with minimal performance loss and outperforming prior methods by 28%.
Authors:Zequn Yang, Hongfa Wang, Di Hu
Abstract:
Interactions between modalities -- redundancy, uniqueness, and synergy -- collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction. Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at https://github.com/GeWu-Lab/LSMI_Estimator.
中文: LSMI估计器基于逐点信息理论,可精确量化样本级别的多模态交互作用(冗余性、独特性和协同性),实现了细粒度数据分析,并支持知识蒸馏和模型集成等实际应用。
English: The LSMI estimator is introduced to accurately quantify sample-level multimodal interactions—redundancy, uniqueness, and synergy—using pointwise information theory, enabling fine-grained analysis and practical applications like knowledge distillation and model ensembling.
Authors:Yichen Luo, Jia Wang, Dapeng Lan, Yu Liu, Zhibo Pang
Abstract:
Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at https://github.com/YichenLuo-0/MMET.
中文: 本文提出多输入多尺度高效变换器(MMET),通过创新的序列解耦和嵌入技术克服了偏微分方程求解中多尺度泛化和计算效率的局限,在多个基准测试中展现出卓越的精度与效率优势。
English: This paper introduces the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework that overcomes limitations in multi-scale generalization and computational efficiency for solving Partial Differential Equations (PDEs) through innovative sequence decoupling and embedding techniques, demonstrating superior accuracy and efficiency across diverse benchmarks.
Authors:Xiuyu Yang, Shuhan Tan, Philipp Krähenbühl
Abstract:
An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen
中文: InfGen是一种统一的下一令牌预测模型,通过交替执行闭环运动模拟和场景生成来实现稳定的长期交通模拟,在短期和长期场景中均达到最优性能。
English: InfGen is a unified next-token prediction model that enables stable long-term traffic simulation by interleaving closed-loop motion simulation and scene generation, achieving state-of-the-art performance in both short-term and long-term scenarios.
Authors:Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao
Abstract:
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.
Chinese: 统一图像理解与生成模型面临模态对齐模式的根本冲突,UniFork通过Y形架构在浅层共享表示学习,同时在深层采用任务特定分支来平衡共享学习与专业化,有效解决了这一问题。
English: Unified image understanding and generation models face a fundamental conflict in modality alignment patterns, which UniFork addresses through a Y-shaped architecture that shares shallow layers while employing task-specific deep branches to balance shared learning and specialization.
Authors:Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac'h, Preston Culbertson
Abstract:
Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at https://github.com/bdaiinstitute/judo.
中文: 摘要介绍了Judo,这是一个基于Python的软件包,它提供了强大的采样模型预测控制算法实现和工具,用于快速原型设计、评估和部署,并利用MuJoCo在不同硬件上实现实时性能。
English: The abstract introduces Judo, a Python-based software package that provides robust implementations and tools for prototyping, evaluating, and deploying sampling-based model predictive control algorithms, leveraging MuJoCo for real-time performance across various hardware.
Authors:Qing Xu, Yuxiang Luo, Wenting Duan, Zhen Chen
Abstract:
Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-temporal prompt encoder (STP-Encoder) to capture long-range spatial and temporal relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.
中文摘要:Co-Seg++框架提出了一种新颖的协同分割范式,通过时空提示编码器和多任务协作解码器实现语义分割与实例分割任务的相互增强,在多种医学影像数据集上取得了最优的分割性能。
English Summary: The Co-Seg++ framework introduces a novel co-segmentation approach that enables mutual enhancement between semantic and instance segmentation tasks through a spatio-temporal prompt encoder and multi-task collaborative decoder, achieving state-of-the-art performance across multiple medical imaging datasets.
Authors:Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen
Abstract:
Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the *KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods -- *post-fill eviction* -- has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to *recency eviction* methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.
Chinese: 本文提出KV占用作为统一评估指标,揭示了现有键值缓存管理方法的局限性,并通过改进策略显著降低了内存使用,同时保持了长上下文任务中的性能表现。
English: This paper introduces the KV footprint as a unified metric to evaluate key-value cache management methods, revealing limitations in prior approaches and proposing adaptations that significantly reduce memory usage while maintaining performance in long-context tasks.
Authors:Teng Guo, Jingjin Yu
Abstract:
We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack's scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack's novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision.
The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.
中文:RGBTrack是一种仅使用RGB数据的实时6D姿态估计与跟踪框架,无需深度输入即可实现高精度和实时性能,适用于机器人和增强现实等领域。
English: RGBTrack is a robust, real-time 6D pose estimation and tracking framework that uses only RGB data, achieving competitive accuracy and efficiency without depth input, making it suitable for robotics and augmented reality applications.
Authors:Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
Abstract:
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
中文:MEXA是一种无需训练的框架,能根据输入模态和任务需求动态选择和聚合专业专家模型,无需额外训练即可实现跨领域的透明高效多模态推理。
English: MEXA is a training-free framework that dynamically selects and aggregates specialized expert models based on input modality and task demands, enabling transparent and effective multimodal reasoning across diverse domains without additional training.
Authors:Ke Li, Chenyu Zhang, Yuxin Ding, Xianbiao Hu, Ruwen Qin
Abstract:
Driving scenes are inherently heterogeneous and dynamic. Multi-attribute scene identification, as a high-level visual perception capability, provides autonomous vehicles (AVs) with essential contextual awareness to understand, reason through, and interact with complex driving environments. Although scene identification is best modeled as a multi-label classification problem via multitask learning, it faces two major challenges: the difficulty of acquiring balanced, comprehensively annotated datasets and the need to re-annotate all training data when new attributes emerge. To address these challenges, this paper introduces a novel deep learning method that integrates Knowledge Acquisition and Accumulation (KAA) with Consistency-based Active Learning (CAL). KAA leverages monotask learning on heterogeneous single-label datasets to build a knowledge foundation, while CAL bridges the gap between single- and multi-label data, adapting the foundation model for multi-label scene classification. An ablation study on the newly developed Driving Scene Identification (DSI) dataset demonstrates a 56.1% improvement over an ImageNet-pretrained baseline. Moreover, KAA-CAL outperforms state-of-the-art multi-label classification methods on the BDD100K and HSD datasets, achieving this with 85% less data and even recognizing attributes unseen during foundation model training. The DSI dataset and KAA-CAL implementation code are publicly available at https://github.com/KELISBU/KAA-CAL .
中文: 本文提出新颖的KAA-CAL方法,通过整合单标签数据集的知识获取与主动学习,实现了高效的多标签驾驶场景分类,在显著减少数据需求的同时获得了卓越的性能提升。
English: This paper introduces a novel KAA-CAL method that integrates knowledge acquisition from single-label datasets with active learning to enable efficient multi-label driving scene classification, achieving significant performance improvements with substantially reduced data requirements.
Authors:Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Abstract:
Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM's internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://github.com/ECNU-Text-Computing/cot-hallu-detect .
中文:思维链提示方法能减少大语言模型的幻觉,但会掩盖检测所需的关键信号,从而削弱各种检测方法的有效性,揭示了推理与检测之间的权衡。
English: Chain-of-Thought prompting reduces hallucinations in Large Language Models but impairs detection methods by obscuring critical signals, revealing a trade-off between reasoning and detection effectiveness.
Authors:Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel
Abstract:
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.
Chinese: 本文针对科学发现中现有强化学习方法的不足,引入了一种处理代理奖励不确定性的稳健算子,提出了一种新算法,能在组合搜索空间中生成更高质量和更多样化的候选对象。
English: This paper addresses the limitations of existing reinforcement learning methods in scientific discovery by introducing a robust operator that handles uncertainty in proxy rewards, leading to a novel algorithm that produces higher-quality and more diverse candidates in combinatorial search spaces.
Authors:Sahil Kale, Vijaykant Nadadur
Abstract:
LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.
中文: LaTeX的精确性使其成为科学文档的理想选择,而TeXpert基准测试表明,尽管大型语言模型在生成复杂LaTeX代码时准确性下降,但像DeepSeek这样的开源模型与闭源模型表现相当,同时揭示了因训练数据不足导致的常见格式错误。
English: LaTeX's precision makes it ideal for scientific documents, and the TeXpert benchmark reveals that while LLMs struggle with generating accurate LaTeX code as complexity increases, open-source models like DeepSeek compete closely with closed-source ones, exposing common formatting errors due to limited training data.
Authors:Annika Thomas, Robaire Galliath, Aleksander Garbuz, Luke Anger, Cormac O'Neill, Trevor Johst, Dami Thomas, George Lordos, Jonathan P. How
Abstract:
Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual-inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero-shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph-based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph-theoretic data association. This method enables accurate and drift-free global localization in visually ambiguous settings. LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: https://github.com/mit-acl/lunarloc-data.
中文摘要:LunarLoc是一种月球表面全局定位方法,通过立体影像的零样本实例分割提取巨石地标,并采用基于图论的场景匹配技术,可在视觉模糊环境中实现厘米级精度的无漂移定位。
English Summary: LunarLoc is a global localization method for lunar surface operations that uses instance segmentation of boulder landmarks from stereo imagery and graph-based terrain matching to achieve sub-centimeter accuracy without odometry drift.
Authors:Bin Huang, Feihong Xu, Xinchong Shi, Shan Huang, Binxuan Li, Fei Li, Qiegen Liu
Abstract:
In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation reactions of different tracers in PET imaging have the same energy, making it difficult to distinguish the tracer signals. In this study, a multi-latent space guided texture conditional diffusion transformer model (MS-CDT) is proposed for PET tracer separation. To the best of our knowledge, this is the first attempt to use texture condition and multi-latent space for tracer separation in PET imaging. The proposed model integrates diffusion and transformer architectures into a unified optimization framework, with the novel addition of texture masks as conditional inputs to enhance image details. By leveraging multi-latent space prior derived from different tracers, the model captures multi-level feature representations, aiming to balance computational efficiency and detail preservation. The texture masks, serving as conditional guidance, help the model focus on salient structural patterns, thereby improving the extraction and utilization of fine-grained image textures. When combined with the diffusion transformer backbone, this conditioning mechanism contributes to more accurate and robust tracer separation. To evaluate its effectiveness, the proposed MS-CDT is compared with several advanced methods on two types of 3D PET datasets: brain and chest scans. Experimental results indicate that MS-CDT achieved competitive performance in terms of image quality and preservation of clinically relevant information. Code is available at: https://github.com/yqx7150/MS-CDT.
中文: 本研究提出了一种多潜在空间引导的纹理条件扩散变换模型(MS-CDT),首次在PET多示踪剂成像中利用纹理条件和多潜在空间来区分难以分离的示踪剂信号,在保持图像质量和临床信息方面表现出优异性能。
English: This study introduces a multi-latent space guided texture conditional diffusion transformer model (MS-CDT), the first method to utilize texture conditions and multi-latent spaces for separating indistinguishable tracer signals in multi-tracer PET imaging, achieving competitive performance in preserving image quality and clinical information.
Authors:Jun Fu, Bin Tian, Haonan Chen, Shi Meng, Tingting Yao
Abstract:
Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57\%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.
中文摘要:本文提出了一种基于Transformer的端到端自动泊车框架,通过模仿专家驾驶行为,在集成鸟瞰图特征和行人预测模块后实现了96.57%的成功率,且位置与方向误差极小。
English Summary: This paper introduces a Transformer-based end-to-end autonomous parking framework that learns from expert demonstrations, achieving a 96.57% success rate with minimal positional and orientation errors through integrated BEV features and pedestrian prediction modules.
Authors:Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong
Abstract:
We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a "hint") as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, using up to 3.5 times less inference budget, and, given sufficient inference budget, achieves performance comparable to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO.
中文摘要:本文提出RATTPO方法,通过大语言模型迭代优化提示并结合奖励感知反馈,无需特定任务描述即可适应不同奖励场景,在文本到图像生成中实现了高效通用的提示优化,显著超越现有测试时搜索基线。
English Summary: This paper introduces RATTPO, a reward-agnostic test-time prompt optimization method that enhances text-to-image generation by iteratively refining prompts using LLMs and reward-aware feedback, achieving superior efficiency and adaptability across diverse reward scenarios without requiring task-specific modifications.
Authors:Semin Kim, Yeonwoo Cha, Jaehoon Yoo, Seunghoon Hong
Abstract:
We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a "hint") as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, running 4.8 times faster than naive reward-agnostic test-time search baseline on average. Furthermore, with sufficient inference budget, it can achieve comparable performance to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO.
中文摘要:本文提出RATTPO方法,通过大语言模型迭代优化提示并结合奖励感知反馈,无需特定任务描述即可适应不同奖励场景,在文本到图像生成中实现了高效通用的提示优化,显著超越现有测试时搜索基线。
English Summary: This paper introduces RATTPO, a reward-agnostic test-time prompt optimization method that enhances text-to-image generation by iteratively refining prompts using LLMs and reward-aware feedback, achieving superior efficiency and adaptability across diverse reward scenarios without requiring task-specific modifications.
Authors:Chaehyeon Song, Dongjae Lee, Jongwoo Lim, Ayoung Kim
Abstract:
Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.
中文: 本研究提出了一种无偏的圆形图案投影模型,以克服现有基于圆心的相机校准的局限性,通过引入圆心不确定性来提高精度和鲁棒性,结果显示其性能显著优于棋盘格方法。
English: The study introduces an unbiased projection model for circular patterns to overcome the limitations of existing centroid-based camera calibration, incorporating centroid uncertainty to enhance accuracy and robustness, with results showing significant improvements over checkerboard methods.
Authors:Zeyneddin Oz, Shreyas Korde, Marius Bock, Kristof Van Laerhoven
Abstract:
The rapid evolution of sensors and resource-efficient machine learning models has spurred the widespread adoption of wearable fitness tracking devices. Equipped with inertial sensors, such devices can continuously capture physical movements for fitness technology (FitTech), enabling applications from sports optimization to preventive healthcare. Traditional Centralized Learning approaches to detect fitness activities struggle with data privacy concerns, regulatory restrictions, and communication inefficiencies. In contrast, Federated Learning (FL) enables a decentralized model training by communicating model updates rather than potentially private wearable sensor data. Applying FL to FitTech presents unique challenges, such as data imbalance, lack of labeled data, heterogeneous user activities, and trade-offs between personalization and generalization. To simplify research on FitTech in FL, we present the FedFitTech baseline, under the Flower framework, which is publicly available and widely used by both industry and academic researchers. Additionally, to illustrate its usage, this paper presents a case study that implements a system based on the FedFitTech baseline, incorporating a client-side early stopping strategy and comparing the results. For instance, this system allows wearable devices to optimize the trade-off between capturing common fitness activities and preserving individuals' nuances, thereby enhancing both the scalability and efficiency of privacy-aware fitness tracking applications. The results show that this reduces the overall redundant communications by 13%, while maintaining the overall recognition performance at a negligible recognition cost by 1%. Thus, the FedFitTech baseline creates a foundation for a wide range of new research and development opportunities in FitTech, and it is available as open source at: https://github.com/shreyaskorde16/FedFitTech
中文摘要:联邦学习通过去中心化模型训练解决可穿戴健身设备的数据隐私和通信效率问题,FedFitTech基准在保持识别性能基本不变的同时,将通信开销降低了13%。
English Summary: Federated Learning addresses privacy and efficiency issues in wearable fitness tracking by enabling decentralized model training, with the FedFitTech baseline reducing communication overhead by 13% while maintaining near-identical recognition performance.
Authors:Yuchu Jiang, Jiaming Chu, Jian Zhao, Xin Zhang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li
Abstract:
The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.
Chinese: Loupe是一种轻量级框架,通过融合块感知分类器和条件查询分割模块,实现了深度伪造的联合检测与定位,借助伪标签引导的测试时自适应机制,在多种伪造模式中取得最优性能。
English: Loupe is a lightweight framework that combines patch-aware classification and conditional query-based segmentation to jointly detect and localize deepfakes, achieving state-of-the-art performance through test-time adaptation and robust generalization across manipulation types.
Authors:Xiaoyu Shi, Rahul Kumar Jain, Yinhao Li, Ruibo Hou, Jingliang Cheng, Jie Bai, Guohua Zhao, Lanfen Lin, Rui Xu, Yen-wei Chen
Abstract:
Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data.
To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques.
Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.
中文摘要:TextBraTS数据集作为首个结合MRI影像与文本标注的多模态数据集填补了脑肿瘤分析领域的空白,提出的新型框架和跨注意力方法通过图文融合显著提升了分割精度。
English Summary: The TextBraTS dataset is introduced as the first multimodal dataset pairing MRI volumes with textual annotations to advance brain tumor segmentation, and a novel framework with cross-attention mechanisms demonstrates improved accuracy through effective text-image fusion.
Authors:Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii
Abstract:
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.
Chinese: 本文提出了一种新颖的离线策略强化学习方法,通过将对抗学习重构为软约束优化问题,利用策略评估的对称性消除了额外环境交互的需求。
English: This paper introduces a novel off-policy reinforcement learning method that reformulates adversarial learning as a soft-constrained optimization problem, eliminating the need for extra environmental interactions by leveraging symmetric policy evaluation properties.
Authors:Mengyu Wang, Tiejun Ma, Shay B. Cohen
Abstract:
Stock selection, which aims to predict stock prices and identify the most profitable ones, is a crucial task in finance. While existing methods primarily focus on developing model structures and building graphs for improved selection, pre-training strategies remain underexplored in this domain. Current stock series pre-training follows methods from other areas without adapting to the unique characteristics of financial data, particularly overlooking stock-specific contextual information and the non-stationary nature of stock prices. Consequently, the latent statistical features inherent in stock data are underutilized. In this paper, we propose three novel pre-training tasks tailored to stock data characteristics: stock code classification, stock sector classification, and moving average prediction. We develop the Stock Specialized Pre-trained Transformer (SSPT) based on a two-layer transformer architecture. Extensive experimental results validate the effectiveness of our pre-training methods and provide detailed guidance on their application. Evaluations on five stock datasets, including four markets and two time periods, demonstrate that SSPT consistently outperforms the market and existing methods in terms of both cumulative investment return ratio and Sharpe ratio. Additionally, our experiments on simulated data investigate the underlying mechanisms of our methods, providing insights into understanding price series. Our code is publicly available at: https://github.com/astudentuser/Pre-training-Time-Series-Models-with-Stock-Data-Customization.
中文摘要:本文提出了一种专门针对股票数据特性的预训练框架SSPT,通过三项定制任务更有效地利用金融数据特征,在多个市场和时间段内持续超越现有方法。
English Summary: This paper introduces a specialized pre-training framework called SSPT, which uses three tailored tasks to better capture financial data characteristics and consistently outperforms existing methods across multiple markets and time periods.
Authors:Weinan Guan, Wei Wang, Bo Peng, Ziwen He, Jing Dong, Haonan Cheng
Abstract:
With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure.However, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art performance.Our code is available at https://github.com/WeinanGuan/NASA-Swin.
中文摘要:本文提出一种新颖的噪声感知自注意力模块,通过聚焦扩散生成图像特有的噪声模式,结合跨模态融合策略,在检测未知生成方法的伪造图像时实现了最先进的性能。
English Summary: This paper introduces a novel Noise-Aware Self-Attention (NASA) module integrated into Swin Transformer to detect diffusion-generated images by focusing on their distinct noise patterns, achieving state-of-the-art performance even with unseen generation methods.
Authors:Fang Chen, Weifeng Zhang, Xingyu Ai, BingXuan Li, An Li, Qiegen Liu
Abstract:
Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, the complexity of its synthesis process and constraints on transportation time have limited its clinical application. Among different forms of raw data acquired by the scanner, sinogram is a commonly used representation in PET imaging. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation errors during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET sinograms into 18F-DOPA PET sinograms. During inference, an initial synthetic 18F-DOPA PET sinogram is first generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process. Experimental results demonstrated that PJDM effectively improved both sinogram quality and the final synthetic outcomes. The code is available at: https://github.com/yqx7150/PJDM.
中文摘要:本研究提出了一种先验引导联合扩散模型(PJDM),通过在投影域建模将18F-FDG PET正弦图转换为18F-DOPA PET正弦图,有效提升了合成图像质量并减少了重建误差。
English Summary: This study introduces a prior-guided joint diffusion model (PJDM) that converts 18F-FDG PET sinograms into 18F-DOPA PET sinograms using projection domain modeling to enhance synthetic accuracy and reduce reconstruction errors.
Authors:Yunhan Ren, Feng Luo, Siyu Huang
Abstract:
While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD
中文摘要:本文提出了少样本广义类别发现(FSGCD)任务,并设计了一种基于相似性检索的决策边界增强框架,通过将已知类别的边界知识迁移至未知类别,在六个公开基准上实现了最优性能。
English Summary: This paper introduces the Few-shot Generalized Category Discovery (FSGCD) task and proposes a decision boundary enhancement framework that leverages affinity-based retrieval to transfer learned boundaries from known to unknown categories, achieving state-of-the-art performance on six benchmarks.
Authors:Chenxu Wang, Yonggang Jin, Cheng Hu, Youpeng Zhao, Zipeng Dai, Jian Zhao, Shiyu Huang, Liuyu Xiang, Junge Zhang, Zhaofeng He
Abstract:
Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi-Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents' learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked-AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: https://github.com/vcis-wangchenxu/MRDG.git
中文: 本文提出了智能体协同竞争适应(ACCA)框架,用于评估智能体在不同任务、环境及与陌生队友和对手互动中的泛化能力,并引入了多检索动态生成(MRDG)方法,该方法利用行为轨迹有效建模交互关系,在基准测试中展现出卓越性能。
English: The paper introduces Agent Collaborative-Competitive Adaptation (ACCA), a comprehensive framework for evaluating agents' generalization across tasks, environments, and interactions with unfamiliar teammates and opponents, and proposes the Multi-Retrieval and Dynamic Generation (MRDG) method, which effectively models these interactions using behavioral trajectories and demonstrates superior performance in benchmark tests.
Authors:Matthew Ebisu, Hang Yu, Reuben Aronson, Elaine Short
Abstract:
Nonverbal visual symbols and displays play an important role in communication when humans and robots work collaboratively. However, few studies have investigated how different types of non-verbal cues affect objective task performance, especially in a dynamic environment that requires real time decision-making. In this work, we designed a collaborative navigation task where the user and the robot only had partial information about the map on each end and thus the users were forced to communicate with a robot to complete the task. We conducted our study in a public space and recruited 37 participants who randomly passed by our setup. Each participant collaborated with a robot utilizing either animated anthropomorphic eyes and animated icons, or static anthropomorphic eyes and static icons. We found that participants that interacted with a robot with animated displays reported the greatest level of trust and satisfaction; that participants interpreted static icons the best; and that participants with a robot with static eyes had the highest completion success. These results suggest that while animation can foster trust with robots, human-robot communication can be optimized by the addition of familiar static icons that may be easier for users to interpret. We published our code, designed symbols, and collected results online at: https://github.com/mattufts/huamn_Cozmo_interaction.
Chinese: 在人与机器人协作中,动态显示增强信任与满意度,而静态图标提高任务完成度与可理解性,表明结合两者可实现最优沟通效果。
English: Animated displays in human-robot collaboration enhance trust and satisfaction, but static icons improve task success and interpretability, suggesting a balanced approach for optimal communication.
Authors:Manno Versluis, Yizhuo Wu, Chang Gao
Abstract:
Digital predistortion (DPD) is crucial for linearizing radio frequency (RF) power amplifiers (PAs), improving signal integrity and efficiency in wireless systems. Neural network (NN)-based DPD methods surpass traditional polynomial models but face computational challenges limiting their practical deployment. This paper introduces SparseDPD, an FPGA accelerator employing a spatially sparse phase-normalized time-delay neural network (PNTDNN), optimized through unstructured pruning to reduce computational load without accuracy loss. Implemented on a Xilinx Zynq-7Z010 FPGA, SparseDPD operates at 170 MHz, achieving exceptional linearization performance (ACPR: -59.4 dBc, EVM: -54.0 dBc, NMSE: -48.2 dB) with only 241 mW dynamic power, using 64 parameters with 74% sparsity. This work demonstrates FPGA-based acceleration, making NN-based DPD practical and efficient for real-time wireless communication applications. Code is publicly available at https://github.com/MannoVersluis/SparseDPD.
中文:本文提出SparseDPD,一种采用剪枝神经网络的FPGA加速器,能以低功耗高效线性化无线系统中的功率放大器,并保持卓越性能。
English: This paper introduces SparseDPD, an FPGA accelerator using a pruned neural network to efficiently linearize power amplifiers in wireless systems with high performance and low power consumption.
Authors:Changsheng Gao, Zijie Liu, Li Li, Dong Liu, Xiaoyan Sun, Weisi Lin
Abstract:
Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage burden. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unexplored. In this paper, we present the first systematic study on universal feature coding for large models. The key challenge lies in the inherently diverse and distributionally incompatible nature of features extracted from different models. For example, features from DINOv2 exhibit highly peaky, concentrated distributions, while those from Stable Diffusion 3 (SD3) are more dispersed and uniform. This distributional heterogeneity severely hampers both compression efficiency and cross-model generalization. To address this, we propose a learned peaky-to-balanced distribution transformation, which reshapes highly skewed feature distributions into a common, balanced target space. This transformation is non-uniform, data-driven, and plug-and-play, enabling effective alignment of heterogeneous distributions without modifying downstream codecs. With this alignment, a universal codec trained on the balanced target distribution can effectively generalize to features from different models and tasks. We validate our approach on three representative large models (LLaMA3, DINOv2, and SD3) across multiple tasks and modalities. Extensive experiments show that our method achieves notable improvements in both compression efficiency and cross-model generalization over task-specific baselines. All source code has been made available at https://github.com/chansongoal/DT-UFC.
Chinese: 本文提出了一种针对大模型的通用特征编码方法,通过学习的分布变换将异构特征对齐到平衡空间,从而在多种任务和模态中显著提升了压缩效率和跨模型泛化能力。
English: This paper introduces a universal feature coding method for large models that employs a learned distribution transformation to align heterogeneous features into a balanced space, enhancing compression efficiency and cross-model generalization across diverse tasks and modalities.
Authors:Tara Akhound-Sadegh, Jungyoon Lee, Avishek Joey Bose, Valentin De Bortoli, Arnaud Doucet, Michael M. Bronstein, Dominique Beaini, Siamak Ravanbakhsh, Kirill Neklyudov, Alexander Tong
Abstract:
Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita
中文: 本文提出渐进推理时间退火(PITA)新框架,通过结合温度退火与扩散平滑技术,首次实现了对复杂分子系统的高效平衡采样,并大幅降低了计算成本。
English: The paper introduces Progressive Inference-Time Annealing (PITA), a novel framework that combines temperature annealing and diffusion smoothing to enable efficient equilibrium sampling of complex molecular systems with significantly reduced computational cost.
Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Abstract:
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).
中文:IS-Bench作为首个评估具身智能体交互安全性的多模态基准,通过对主流视觉语言模型的广泛测试,揭示了当前模型缺乏风险意识以及安全导向推理存在的任务完成度妥协问题。
English: IS-Bench is introduced as the first multimodal benchmark for evaluating interactive safety in embodied agents, revealing current models' lack of risk awareness and the trade-offs of safety-focused reasoning through extensive testing on leading VLMs.
Authors:Chunhou Ji, Qiumeng Li
Abstract:
GPS trajectory data reveals valuable patterns of human mobility and urban dynamics, supporting a variety of spatial applications. However, traditional methods often struggle to extract deep semantic representations and incorporate contextual map information. We propose TrajSceneLLM, a multimodal perspective for enhancing semantic understanding of GPS trajectories. The framework integrates visualized map images (encoding spatial context) and textual descriptions generated through LLM reasoning (capturing temporal sequences and movement dynamics). Separate embeddings are generated for each modality and then concatenated to produce trajectory scene embeddings with rich semantic content which are further paired with a simple MLP classifier. We validate the proposed framework on Travel Mode Identification (TMI), a critical task for analyzing travel choices and understanding mobility behavior. Our experiments show that these embeddings achieve significant performance improvement, highlighting the advantage of our LLM-driven method in capturing deep spatio-temporal dependencies and reducing reliance on handcrafted features. This semantic enhancement promises significant potential for diverse downstream applications and future research in geospatial artificial intelligence. The source code and dataset are publicly available at: https://github.com/februarysea/TrajSceneLLM.
中文摘要:TrajSceneLLM提出了一种融合地图图像与大语言模型生成文本的多模态框架,通过增强轨迹语义嵌入显著提升了出行方式识别的性能,为GPS轨迹分析提供了新的解决方案。
English Summary: TrajSceneLLM is a novel multimodal framework that enhances GPS trajectory analysis by integrating map images and LLM-generated text descriptions, achieving superior performance in travel mode identification through enriched semantic embeddings.
Authors:Yao Lu, Zhaiyuan Ji, Jiawei Du, Yu Shanqing, Qi Xuan, Tianyi Zhou
Abstract:
Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.
中文: AutoAnnotator框架通过LLM元控制器选择专用SLM并进行投票标注的双层架构,解决了LLM在细粒度标注中成本高、精度低的问题,相比GPT-3.5-turbo实现标注成本降低74.15%且准确率提升6.21%。
English: The AutoAnnotator framework addresses the high cost and low accuracy of LLMs in fine-grained annotation by using a two-layer system where an LLM meta-controller selects and codes for specialized SLMs that perform voting-based annotation, achieving a 74.15% cost reduction and 6.21% accuracy improvement over GPT-3.5-turbo.
Authors:Yunhao Hou, Bochao Zou, Min Zhang, Ran Chen, Shangdong Yang, Yanmei Zhang, Junbao Zhuo, Siheng Chen, Jiansheng Chen, Huimin Ma
Abstract:
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.
中文: 协作感知能帮助自动驾驶车辆减少遮挡并提高感知精度,但以往研究多集中于车与车及车与基础设施的协作,对无人机提供的动态鸟瞰视角关注不足,为此我们推出了首个大规模真实世界空地协同3D感知数据集AGC-Drive,包含多视角数据和基准测试。
English: Collaborative perception enhances autonomous vehicles' ability to overcome occlusions and boost accuracy, yet aerial perspectives from UAVs have been underutilized due to a lack of datasets, leading to the introduction of AGC-Drive, the first large-scale real-world aerial-ground cooperative 3D perception dataset with comprehensive multi-agent data and benchmarks.
Authors:Chao He, Hongxi Wei
Abstract:
Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive attention from researchers by achieving outstanding performance on various computer tasks. Nevertheless, the suitability of Mamba for large-scale image retrieval tasks still needs to be explored. Towards this end, we propose a visual state space hashing model, called MambaHash. Concretely, we propose a backbone network with stage-wise architecture, in which grouped Mamba operation is introduced to model local and global information by utilizing Mamba to perform multi-directional scanning along different groups of the channel. Subsequently, the proposed channel interaction attention module is used to enhance information communication across channels. Finally, we meticulously design an adaptive feature enhancement module to increase feature diversity and enhance the visual representation capability of the model. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that compared with the state-of-the-art deep hashing methods, our proposed MambaHash has well efficiency and superior performance to effectively accomplish large-scale image retrieval tasks. Source code is available https://github.com/shuaichaochao/MambaHash.git
中文: MambaHash是一种新颖的视觉状态空间哈希模型,通过分组Mamba操作和注意力模块,在大规模图像检索任务中实现了高效且卓越的性能。
English: MambaHash is a novel visual state space hashing model that employs grouped Mamba operations and attention modules to efficiently achieve superior performance in large-scale image retrieval tasks.
Authors:Hen Kas-Sharir, Gal Sela, Erez Petrank
Abstract:
The size of collections, maps, and data structures in general, constitutes a fundamental property. An implementation of the size method is required in most programming environments. Nevertheless, in a concurrent environment, integrating a linearizable concurrent size introduces a noticeable overhead on all operations of the data structure, even when the size method is not invoked during the execution. In this work we present a study of synchronization methods in an attempt to improve the performance of the data structure. In particular, we study a handshake technique that is commonly used with concurrent garbage collection, an optimistic technique, and a lock-based technique. Evaluation against the state-of-the-art size methodology demonstrates that the overhead can be significantly reduced by selecting the appropriate synchronization approach, but there is no one-size-fits-all method. Different scenarios call for different synchronization methods, as rigorously shown in this study. Nevertheless, our findings align with general trends in concurrent computing. In scenarios characterized by low contention, optimistic and lock-based approaches work best, whereas under high contention, the most effective solutions are the handshake approach and the wait-free approach.
中文摘要:本研究探讨了多种同步方法以减少在数据结构中实现线性化并发大小带来的开销,发现没有单一方法适用于所有场景,但通过根据竞争程度选择合适方法可显著提升性能。
English Summary: This study explores various synchronization methods to reduce the overhead of implementing linearizable concurrent size in data structures, finding that no single approach fits all scenarios but performance can be significantly improved by matching the method to contention levels.
Authors:Nikola JovanoviÄ, Ismail Labiad, Tomáš SouÄek, Martin Vechev, Pierre Fernandez
Abstract:
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values.
中文: 本研究首次提出了针对自回归图像生成模型的令牌级水印方法,通过定制化分词器微调和水印同步层解决了逆向循环一致性问题,有效提升了水印对图像变换与攻击的鲁棒性。
English: This study introduces the first token-level watermarking method for autoregressive image generation models, addressing the challenge of reverse cycle-consistency and enhancing robustness against image transformations and attacks through custom tokenizer finetuning and a watermark synchronization layer.
Authors:Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser
Abstract:
Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide.
中文摘要:本文提出一种融合三维点云与RGB图像的分层由粗到精方法,实现了滑坡监测中的密集三维位移估计,在真实数据集上展现出优越的空间覆盖率和精度表现。
English Summary: This paper introduces a hierarchical coarse-to-fine method that integrates 3D point clouds and RGB images to achieve dense 3D displacement estimation for landslide monitoring, demonstrating superior spatial coverage and accuracy on real-world datasets.
Authors:Weeyoung Kwon, Jeahun Sung, Minkyu Jeon, Chanho Eom, Jihyong Oh
Abstract:
Neural rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved significant progress in photorealistic 3D scene reconstruction and novel view synthesis. However, most existing models assume clean and high-resolution (HR) multi-view inputs, which limits their robustness under real-world degradations such as noise, blur, low-resolution (LR), and weather-induced artifacts. To address these limitations, the emerging field of 3D Low-Level Vision (3D LLV) extends classical 2D Low-Level Vision tasks including super-resolution (SR), deblurring, weather degradation removal, restoration, and enhancement into the 3D spatial domain. This survey, referred to as R\textsuperscript{3}eVision, provides a comprehensive overview of robust rendering, restoration, and enhancement for 3D LLV by formalizing the degradation-aware rendering problem and identifying key challenges related to spatio-temporal consistency and ill-posed optimization. Recent methods that integrate LLV into neural rendering frameworks are categorized to illustrate how they enable high-fidelity 3D reconstruction under adverse conditions. Application domains such as autonomous driving, AR/VR, and robotics are also discussed, where reliable 3D perception from degraded inputs is critical. By reviewing representative methods, datasets, and evaluation protocols, this work positions 3D LLV as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments.
中文: 神经渲染技术虽在三维场景重建中取得进展,但难以应对真实世界中的退化问题,因此三维低层视觉领域应运而生,通过超分辨率和去模糊等任务提升三维重建的鲁棒性。
English: Neural rendering techniques like NeRF and 3DGS have advanced 3D scene reconstruction but struggle with real-world degradations, prompting the emergence of 3D Low-Level Vision to enhance robustness through restoration and super-resolution in 3D space.
Authors:Abdulvahap Mutlu, Åengül DoÄan, Türker Tuncer
Abstract:
Amyotrophic Lateral Sclerosis (ALS) is a rare neurodegenerative disease, and high-quality EEG data from ALS patients are scarce. This data scarcity, coupled with severe class imbalance between ALS and healthy control recordings, poses a challenge for training reliable machine learning classifiers. In this work, we address these issues by generating synthetic EEG signals for ALS patients using a Conditional Wasserstein Generative Adversarial Network (CWGAN). We train CWGAN on a private EEG dataset (ALS vs. non-ALS) to learn the distribution of ALS EEG signals and produce realistic synthetic samples. We preprocess and normalize EEG recordings, and train a CWGAN model to generate synthetic ALS signals. The CWGAN architecture and training routine are detailed, with key hyperparameters chosen for stable training. Qualitative evaluation of generated signals shows that they closely mimic real ALS EEG patterns. The CWGAN training converged with generator and discriminator loss curves stabilizing, indicating successful learning. The synthetic EEG signals appear realistic and have potential use as augmented data for training classifiers, helping to mitigate class imbalance and improve ALS detection accuracy. We discuss how this approach can facilitate data sharing and enhance diagnostic models.
中文: 本研究采用条件Wasserstein生成对抗网络为ALS患者生成逼真的合成脑电信号,通过缓解数据稀缺和类别不平衡问题来提升ALS检测分类器的训练效果。
English: This study uses a Conditional Wasserstein Generative Adversarial Network to generate realistic synthetic EEG signals for ALS patients, addressing data scarcity and class imbalance to improve machine learning classifier training for ALS detection.
Authors:Jiang Wang, Runwu Shi, Benjamin Yen, He Kong, Kazuhiro Nakadai
Abstract:
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/single-mic-SSL.
中文: 本研究提出了一种在移动机器人上使用单个麦克风的在线声源定位方法,通过轻量级神经网络和扩展卡尔曼滤波器在混响环境中实现实时定位。
English: This study introduces an online sound source localization method using a single microphone on a mobile robot, employing a lightweight neural network and extended Kalman filter to achieve real-time performance in reverberant environments.
Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
Abstract:
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
中文: 本研究提出了用于评估多模态大语言模型后训练方法的基准SEED-Bench-R1,并开发了GRPO-CARE一致性感知强化学习框架,该框架在无需显式监督的情况下显著提升了答案准确性和推理连贯性,实现了明显的性能改进。
English: This study introduces SEED-Bench-R1, a benchmark for evaluating multimodal large language models' post-training methods, and proposes GRPO-CARE, a consistency-aware reinforcement learning framework that enhances both answer accuracy and reasoning coherence, achieving significant performance gains without explicit supervision.
Authors:Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
Abstract:
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.
中文摘要:现有AI模型的安全对齐方法过于浅层,易受潜在偏移影响而引发不安全响应,但提出的分层对抗补丁训练(LAPT)通过针对性优化内部表征,有效提升了安全鲁棒性。
English Summary: Current safety alignment methods for AI models are shallow, making them vulnerable to latent shifts that can trigger unsafe responses, but the proposed Layer-wise Adversarial Patch Training (LAPT) enhances robustness by addressing these internal weaknesses.
Authors:Byung Hoon Lee, Wooseok Shin, Sung Won Han
Abstract:
The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).
中文摘要:提出的TD3Net后端架构通过结合密集跳跃连接与多扩张卷积来消除感受野盲区,在词级唇读任务中以更少参数实现更高准确率,有效提升了时序特征建模能力。
English Summary: The proposed TD3Net backend architecture enhances word-level lipreading by combining dense skip connections with multi-dilated convolutions to eliminate receptive field blind spots, achieving superior accuracy with improved efficiency compared to existing temporal convolutional networks.
Authors:Liangjing Shao, Linxin Bai, Chenkang Du, Xinrong Chen
Abstract:
Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4\%$\sim$10\% lower error. The evaluation code of this work has been published on https://github.com/BaymaxShao/EndoMUST.
中文: 本研究提出了一种新颖的多步微调框架,用于内窥镜中的自监督深度估计,通过分步训练相关网络有效应对光照变化和纹理稀疏问题,在基准数据集上实现了最先进的性能并显著降低了误差。
English: This study introduces a novel multistep fine-tuning framework for self-supervised depth estimation in endoscopy, which effectively addresses lighting variations and sparse textures by training related networks separately in each step, achieving state-of-the-art performance with significantly reduced errors on benchmark datasets.
Authors:Boyu Li, Siyuan He, Hang Xu, Haoqi Yuan, Yu Zang, Liwei Hu, Junpeng Yue, Zhenxiong Jiang, Pengbo Hu, Börje F. Karlsson, Yehui Tang, Zongqing Lu
Abstract:
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which limits their transferability to real-world robots. To address these issues, we present a physics-based simulation platform DualTHOR for complex dual-arm humanoid robots, built upon an extended version of AI2-THOR. Our simulator includes real-world robot assets, a task suite for dual-arm collaboration, and inverse kinematics solvers for humanoid robots. We also introduce a contingency mechanism that incorporates potential failures through physics-based low-level execution, bridging the gap to real-world scenarios. Our simulator enables a more comprehensive evaluation of the robustness and generalization of VLMs in household environments. Extensive evaluations reveal that current VLMs struggle with dual-arm coordination and exhibit limited robustness in realistic environments with contingencies, highlighting the importance of using our simulator to develop more capable VLMs for embodied tasks. The code is available at https://github.com/ds199895/DualTHOR.git.
中文摘要:DualTHOR仿真平台通过引入真实双足人形机器人模型、基于物理的低级执行容错机制及双臂协作任务套件,有效解决了现有 embodied AI 训练中机器人形态简化与随机执行缺失的问题,为开发更具鲁棒性的视觉语言模型提供了关键支持。
English Summary: The DualTHOR simulation platform is introduced to address limitations in current embodied AI training by incorporating realistic dual-arm humanoid robots, physics-based execution with failure contingencies, and comprehensive task evaluation for improved real-world transferability.
Authors:Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò
Abstract:
Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime.
中文摘要:AutoHFormer是一种分层自回归变换器,通过多尺度时序建模、动态窗口注意力和自适应编码技术,在保持严格因果性的同时实现了高效准确的时间序列预测。
English Summary: AutoHFormer is a hierarchical autoregressive transformer that achieves efficient and accurate time series forecasting through multi-scale temporal modeling, dynamic windowed attention, and adaptive encoding while maintaining strict causality.
Authors:Jianzhu Huai, Yuxin Shao, Yujia Zhang, Alper Yilmaz
Abstract:
The rapid advancement of the metaverse, digital twins, and robotics underscores the demand for low-cost, portable mapping systems for reality capture. Current mobile solutions, such as the Leica BLK2Go and lidar-equipped smartphones, either come at a high cost or are limited in range and accuracy. Leveraging the proliferation and technological evolution of mobile devices alongside recent advancements in lidar technology, we introduce a novel, low-cost, portable mobile mapping system. Our system integrates a lidar unit, an Android smartphone, and an RTK-GNSS stick. Running on the Android platform, it features lidar-inertial odometry built with the NDK, and logs data from the lidar, wide-angle camera, IMU, and GNSS. With a total bill of materials (BOM) cost under 2,000 USD and a weight of about 1 kilogram, the system achieves a good balance between affordability and portability. We detail the system design, multisensor calibration, synchronization, and evaluate its performance for tracking and mapping. To further contribute to the community, the system's design and software are made open source at: https://github.com/OSUPCVLab/marslogger_android/releases/tag/v2.1
A novel, low-cost portable mobile mapping system integrating lidar, smartphone, and RTK-GNSS technology has been developed for under $2,000, offering open-source design and software to advance affordable reality capture solutions.
English Summary:
Authors:Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure
Abstract:
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
中文: 提出的DE-detect方法采用多模态方案,结合从音频中提取的转录歌词和语音特征,能有效识别AI生成的音乐,对音频干扰具有更强鲁棒性,在实际应用中优于现有检测器。
English: The proposed DE-detect method uses a multimodal approach combining transcribed sung lyrics and speech features from audio to effectively identify AI-generated music, offering greater robustness against audio perturbations and outperforming existing detectors in real-world applications.
Authors:Cong Wang, Zexuan Deng, Zhiwei Jiang, Fei Shen, Yafeng Yin, Shiwei Gan, Zifeng Cheng, Shiping Ge, Qing Gu
Abstract:
Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.
中文: SignViP提出了一种新颖的手语视频生成框架,通过离散化标记整合多种细粒度条件,在视频质量和语义保真度方面实现了最先进的性能。
English: SignViP introduces a novel framework for sign language video generation by integrating multiple fine-grained conditions through discrete tokenization, achieving state-of-the-art performance in video quality and semantic fidelity.
Authors:Vinicius Yuiti Fukase, Heitor Gama, Barbara Bueno, Lucas Libanio, Anna Helena Reali Costa, Artur Jordao
Abstract:
Critical Learning Periods comprehend an important phenomenon involving deep learning, where early epochs play a decisive role in the success of many training recipes, such as data augmentation. Existing works confirm the existence of this phenomenon and provide useful insights. However, the literature lacks efforts to precisely identify when critical periods occur. In this work, we fill this gap by introducing a systematic approach for identifying critical periods during the training of deep neural networks, focusing on eliminating computationally intensive regularization techniques and effectively applying mechanisms for reducing computational costs, such as data pruning. Our method leverages generalization prediction mechanisms to pinpoint critical phases where training recipes yield maximum benefits to the predictive ability of models. By halting resource-intensive recipes beyond these periods, we significantly accelerate the learning phase and achieve reductions in training time, energy consumption, and CO$_2$ emissions. Experiments on standard architectures and benchmarks confirm the effectiveness of our method. Specifically, we achieve significant milestones by reducing the training time of popular architectures by up to 59.67%, leading to a 59.47% decrease in CO$_2$ emissions and a 60% reduction in financial costs, without compromising performance. Our work enhances understanding of training dynamics and paves the way for more sustainable and efficient deep learning practices, particularly in resource-constrained environments. In the era of the race for foundation models, we believe our method emerges as a valuable framework. The repository is available at https://github.com/baunilhamarga/critical-periods
中文: 本研究提出了一种系统性方法,用于识别深度神经网络训练中的关键学习期,从而可在这些阶段后停止资源密集型技术,实现训练时间减少高达59.67%、二氧化碳排放降低59.47%、成本下降60%,且不影响模型性能。
English: This study introduces a systematic method to identify critical learning periods in deep neural network training, enabling the cessation of resource-intensive techniques after these phases to reduce training time by up to 59.67%, cut CO₂ emissions by 59.47%, and lower costs by 60% without performance loss.
Authors:Kowndinya Boyalakuntla, Abdeslam Boularias, Jingjin Yu
Abstract:
We present Kalman-filter Assisted Reinforcement Learner (KARL) for dynamic object tracking and grasping over eye-on-hand (EoH) systems, significantly expanding such systems capabilities in challenging, realistic environments. In comparison to the previous state-of-the-art, KARL (1) incorporates a novel six-stage RL curriculum that doubles the system's motion range, thereby greatly enhancing the system's grasping performance, (2) integrates a robust Kalman filter layer between the perception and reinforcement learning (RL) control modules, enabling the system to maintain an uncertain but continuous 6D pose estimate even when the target object temporarily exits the camera's field-of-view or undergoes rapid, unpredictable motion, and (3) introduces mechanisms to allow retries to gracefully recover from unavoidable policy execution failures. Extensive evaluations conducted in both simulation and real-world experiments qualitatively and quantitatively corroborate KARL's advantage over earlier systems, achieving higher grasp success rates and faster robot execution speed. Source code and supplementary materials for KARL will be made available at: https://github.com/arc-l/karl.
Chinese: KARL系统通过结合六阶段强化学习课程和卡尔曼滤波器,增强了手眼系统的动态目标跟踪与抓取能力,在复杂环境中提升了运动范围和鲁棒性。
English: The KARL system enhances dynamic object tracking and grasping in eye-on-hand setups by integrating a six-stage RL curriculum and a Kalman filter, improving motion range and robustness in challenging environments.
Authors:Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng
Abstract:
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
Chinese: 本文提出的Polyline Path Masked Attention (PPMA)通过将Mamba2的结构化掩码与创新的二维折线路径扫描策略相结合,增强了视觉Transformer的空间邻接建模能力,在图像分类、目标检测和分割任务中取得了最先进的性能。
English: This paper introduces Polyline Path Masked Attention (PPMA), which enhances Vision Transformers by integrating Mamba2's structured mask with a novel 2D polyline path scanning strategy to better model spatial adjacency, achieving state-of-the-art results in image classification, object detection, and segmentation tasks.
Authors:Hasan Balci, Augustin Luna
Abstract:
Visual analysis of relational data is essential for many real-world analytics tasks, with layout quality being key to interpretability. However, existing layout algorithms often require users to navigate complex parameters to express their intent. We present a user-guided force-directed layout approach that enables intuitive control through freehand sketching. Our method uses classical image analysis techniques to extract structural information from sketches, which is then used to generate positional constraints that guide the layout process. We evaluate the approach on various real and synthetic graphs ranging from small to medium scale, demonstrating its ability to produce layouts aligned with user expectations. An implementation of our method along with documentation and a demo page is freely available on GitHub at https://github.com/sciluna/uggly.
Chinese: 该研究提出了一种用户引导的力导向布局方法,通过手绘草图实现直观控制,利用图像分析提取结构信息并生成位置约束,经多种图验证有效,且已在GitHub上开源。
English: The study introduces a user-guided force-directed layout method that allows intuitive control via freehand sketching, using image analysis to extract structural cues and generate positional constraints, which is validated on various graphs and made available on GitHub.
Authors:Fatmah AlHindaassi, Mohammed Talha Alam, Fakhri Karray
Abstract:
Adverse weather conditions, particularly fog, pose a significant challenge to autonomous vehicles, surveillance systems, and other safety-critical applications by severely degrading visual information. We introduce ADAM-Dehaze, an adaptive, density-aware dehazing framework that jointly optimizes image restoration and object detection under varying fog intensities. A lightweight Haze Density Estimation Network (HDEN) classifies each input as light, medium, or heavy fog. Based on this score, the system dynamically routes the image through one of three CORUN branches: Light, Medium, or Complex, each tailored to its haze regime. A novel adaptive loss balances physical-model coherence and perceptual fidelity, ensuring both accurate defogging and preservation of fine details. On Cityscapes and the real-world RTTS benchmark, ADAM-Dehaze improves PSNR by up to 2.1 dB, reduces FADE by 30 percent, and increases object detection mAP by up to 13 points, while cutting inference time by 20 percent. These results highlight the importance of intensity-specific processing and seamless integration with downstream vision tasks. Code available at: https://github.com/talha-alam/ADAM-Dehaze.
中文: ADAM-Dehaze是一种自适应去雾框架,能根据雾浓度动态处理图像,在提升图像复原和物体检测效果的同时显著改善清晰度与检测精度,并减少处理时间。
English: ADAM-Dehaze is an adaptive dehazing framework that dynamically processes images based on fog density to enhance both image restoration and object detection, achieving significant performance improvements in clarity and detection accuracy while reducing processing time.
Authors:Zhe Wang, Yuhua Ru, Aladine Chetouani, Tina Shiang, Fang Chen, Fabian Bauer, Liping Zhang, Didier Hans, Rachid Jennane, William Ewing Palmer, Mohamed Jarraya, Yung Hsin Chen
Abstract:
Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by generating targeted counterfactual examples. The method navigates the latent space of a diffusion model using a Stochastic Differential Equation (SDE), governed by balancing a classifier-informed boundary drive with a manifold constraint. The resulting counterfactuals are then used within a self-corrective learning strategy to improve the classifier by focusing on its specific areas of uncertainty. Extensive experiments on the public Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) datasets demonstrate that this approach significantly improves classification accuracy across multiple model architectures. Furthermore, the method provides interpretability by visualizing minimal pathological changes and revealing that the learned latent space topology aligns with clinical knowledge of KOA progression. The DCA framework effectively converts model uncertainty into a robust training signal, offering a promising pathway to developing more accurate and trustworthy automated diagnostic systems. Our code is available at https://github.com/ZWang78/DCA.
Chinese: 本文提出了一种基于扩散的反事实增强框架,通过生成针对性反事实样本来提高膝关节骨关节炎分级的模型鲁棒性和可解释性,在临床数据集上显著提升了分类准确性。
English: This paper introduces a Diffusion-based Counterfactual Augmentation (DCA) framework that enhances knee osteoarthritis grading by generating targeted counterfactual examples to improve model robustness and interpretability, significantly boosting classification accuracy on clinical datasets.
Authors:Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada, Haichong K. Zhang, Ziming Zhang
Abstract:
The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.
Chinese: 本研究提出了一种无需视角图像的单视角图像引导点云补全基线方法,仅依赖部分点云输入,通过分层自融合机制超越现有技术,并对图像指导的必要性提出质疑。
English: This study introduces a view-free baseline for single-view image guided point cloud completion that relies solely on partial point clouds, using a hierarchical self-fusion mechanism to outperform existing methods and questioning the necessity of image guidance.
Authors:Wangzhi Zhan, Jianpeng Chen, Dongqi Fu, Dawei Zhou
Abstract:
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UNIMATE, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UNIMATE outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by up to 80.2%, 5.1%, and 50.2%, respectively. We opensource our proposed UNIMATE model and corresponding results at https://github.com/wzhan24/UniMate.
Chinese: UNIMATE作为一种统一的机器学习模型,通过同时处理3D拓扑、密度条件和力学性能,填补了机械超材料设计领域的空白,在生成和预测任务中性能超越基线模型高达80.2%。
English: UNIMATE is a unified machine learning model that addresses the gap in mechanical metamaterial design by simultaneously handling 3D topology, density condition, and mechanical property, outperforming baseline models by up to 80.2% in generation and prediction tasks.
Authors:Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi
Abstract:
Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.
中文: Bohdi是一个仅使用合成数据的异构大语言模型融合框架,通过自动领域探索和自适应数据采样克服了现有方法的局限,显著提升了目标模型的性能和数据效率,同时消除了能力不平衡问题。
English: Bohdi is a synthetic-data-only heterogeneous LLM fusion framework that overcomes limitations of existing methods by enabling automatic domain exploration and adaptive data sampling, significantly enhancing the target LLM's performance and data efficiency while eliminating capability imbalance.
Authors:Hanyu Pei, Jing-Xiao Liao, Qibin Zhao, Ting Gao, Shijun Zhang, Xiaoge Zhang, Feng-Lei Fan
Abstract:
Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons. Along this direction, this work replaces symbolic regression with tensor decomposition (TD) to discover optimal neuronal formulations, offering enhanced stability and faster convergence. Furthermore, we establish theoretical guarantees that modifying the aggregation functions with common activation functions can empower a network with a fixed number of parameters to approximate any continuous function with an arbitrarily small error, providing a rigorous mathematical foundation for the NeuronSeek framework. Extensive empirical evaluations demonstrate that our NeuronSeek-TD framework not only achieves superior stability, but also is competitive relative to the state-of-the-art models across diverse benchmarks. The code is available at https://github.com/HanyuPei22/NeuronSeek.
中文摘要:本研究通过用张量分解替代符号回归来优化神经元结构,增强了NeuronSeek框架的稳定性和收敛速度,在多个基准测试中展现出竞争优势,同时为通用近似能力提供了理论保证。
English Summary: This study enhances the NeuronSeek framework by replacing symbolic regression with tensor decomposition to optimize neuron formulations, achieving improved stability, faster convergence, and competitive performance across various benchmarks while providing theoretical guarantees for universal approximation.
Authors:Guoqing Chao, Zhenghao Zhang, Lei Meng, Jie Wen, Dianhui Chu
Abstract:
Federated multi-view clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG. Our code is publicly available at https://github.com/PaddiHunter/FIMCFG.
中文:FIMCFG方法通过双头图卷积编码器提取全局与视图特定特征,在融合图指导下整合高层特征并进行伪标签监督聚类,有效解决了联邦不完全多视图聚类中的全局信息利用不足和数据缺失问题,实验证明了其优越性。
English: Federated Incomplete Multi-view Clustering with globally Fused Graph guidance (FIMCFG) addresses limitations in existing methods by integrating global information during feature extraction and tackling missing data through a dual-head graph encoder and fused graph supervision, demonstrating superior performance in experiments.
Authors:Haolin Pan, Hongyu Lin, Haoran Luo, Yang Liu, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu
Abstract:
Compiler auto-tuning optimizes pass sequences to improve performance metrics such as Intermediate Representation (IR) instruction count. Although recent advances leveraging Large Language Models (LLMs) have shown promise in automating compiler tuning, two significant challenges still remain: the absence of high-quality reasoning datasets for agents training, and limited effective interactions with the compilation environment. In this work, we introduce Compiler-R1, the first reinforcement learning (RL)-driven framework specifically augmenting LLM capabilities for compiler auto-tuning. Compiler-R1 features a curated, high-quality reasoning dataset and a novel two-stage end-to-end RL training pipeline, enabling efficient environment exploration and learning through an outcome-based reward. Extensive experiments across seven datasets demonstrate Compiler-R1 achieving an average 8.46% IR instruction count reduction compared to opt -Oz, showcasing the strong potential of RL-trained LLMs for compiler optimization. Our code and datasets are publicly available at https://github.com/Panhaolin2001/Compiler-R1.
Chinese: 本文提出Compiler-R1,一个通过强化学习增强大语言模型进行编译器自动调优的框架,它提供高质量推理数据集和两阶段训练流程,在七个数据集上平均减少8.46%的中间表示指令数。
English: This paper introduces Compiler-R1, a reinforcement learning framework that enhances LLMs for compiler auto-tuning by providing a high-quality reasoning dataset and a two-stage training pipeline, achieving an average 8.46% reduction in IR instruction count across seven datasets.
Authors:Xinxing Ren, Qianbo Zang, Zekun Guo
Abstract:
Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text-only inputs, likely due to the lack of Simulink-specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent-based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain-specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at https://github.com/renxinxing123/SimuGen_beta.
中文摘要:大型语言模型在数学推理和代码生成方面表现出色,但在Simulink模型生成领域存在不足,因此提出SimuGen多模态智能体框架,通过结合视觉图表和领域知识来自动生成准确的仿真代码。
English Summary: Large language models excel in math and coding but struggle with Simulink model generation due to training data gaps, prompting the proposed SimuGen framework that uses multimodal agents and domain knowledge to produce reliable simulations.
Authors:Xinxing Ren, Qianbo Zang, Zekun Guo
Abstract:
Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text-only inputs, likely due to the lack of Simulink-specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent-based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain-specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at https://github.com/renxinxing123/SimuGen_beta.
中文摘要:大型语言模型在数学推理和代码生成方面表现出色,但在Simulink模型生成领域存在不足,因此提出SimuGen多模态智能体框架,通过结合视觉图表和领域知识来自动生成准确的仿真代码。
English Summary: Large language models excel in math and coding but struggle with Simulink model generation due to training data gaps, prompting the proposed SimuGen framework that uses multimodal agents and domain knowledge to produce reliable simulations.
Authors:Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam
Abstract:
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1$.$dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at https://aniaggarwal.github.io/ecad and our code is available at https://github.com/aniaggarwal/ecad.
中文: ECAD提出一种遗传算法,可为扩散模型学习高效的缓存策略,在不修改网络参数或参考图像的情况下,显著提升推理速度并优化质量与延迟的权衡。
English: ECAD introduces a genetic algorithm that learns efficient caching schedules for diffusion models, achieving significant speedups and improved quality-latency trade-offs without modifying network parameters or requiring reference images.
Authors:Tevin Wang, Chenyan Xiong
Abstract:
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.
Chinese: AutoRule通过从偏好反馈中自动提取规则并构建规则奖励,结合学习到的奖励模型强化训练,显著提升了模型在AlpacaEval2.0和MT-Bench等基准测试中的性能表现。
English: AutoRule automates the extraction of rules from preference feedback to create rule-based rewards, enhancing reinforcement learning by integrating these with learned reward models and significantly improving model performance on benchmarks like AlpacaEval2.0 and MT-Bench.
Authors:Kyobin Choo, Hyunkyung Han, Jinyeong Kim, Chanyong Yoon, Seong Jae Hwang
Abstract:
In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.
中文: 提出的M2M-Reg框架通过仅使用单模态相似性训练模型并引入GradCyCon正则化,解决了多模态形变图像配准中的挑战,在无需真实变换数据的情况下,显著提升了PET-MRI和FA-MRI等高异质性模态的配准效果。
English: The proposed M2M-Reg framework addresses challenges in multi-modal deformable image registration by training models using mono-modal similarity and introducing GradCyCon regularization, achieving significantly improved alignment for highly heterogeneous modalities like PET-MRI and FA-MRI without requiring ground-truth data.
Authors:Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang
Abstract:
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
中文: 提出的双LoRA学习范式通过交替优化一致性和细节增强模块,训练基于稳定扩散的模型,在视频超分辨率中同时实现逼真的空间细节和时间一致性。
English: The proposed Dual LoRA Learning (DLoRAL) paradigm trains a stable diffusion-based model to simultaneously achieve realistic spatial details and temporal consistency in video super-resolution by alternately optimizing consistency and detail enhancement modules.
Authors:Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li
Abstract:
Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers built for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. We introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE than the strongest sentence-merging baseline. However, its high inference cost and licensing restrict open-source use, and smaller fine-tuned open-source models (e.g., Flan-T5) perform poorly on dense graph generation. To bridge this gap, we propose DiscoSG-Refiner, which drafts a base graph using a seed parser and iteratively refines it with a second model, improving robustness for complex graph generation. Using two small fine-tuned Flan-T5-Base models, DiscoSG-Refiner improves SPICE by approximately 30% over the baseline while achieving 86 times faster inference than GPT-4o. It also delivers consistent gains on downstream VLM tasks, including discourse-level caption evaluation and hallucination detection, outperforming alternative parsers. Code and data are available at https://github.com/ShaoqLin/DiscoSG .
中文: 为解决视觉语言模型中多句子描述解析的不足,研究提出了话语级场景图解析任务DiscoSG及相应数据集,并开发了DiscoSG-Refiner方法,该方法通过迭代优化显著提升了解析性能与效率,同时大幅优于现有基线模型。
English: Vision-Language Models require discourse-level scene graph parsing to overcome the limitations of sentence-merging approaches, leading to the introduction of DiscoSG task and dataset, and the development of DiscoSG-Refiner, which significantly enhances parsing performance and efficiency for downstream tasks.
Authors:Chang Liu, Yimeng Bai, Xiaoyan Zhao, Yang Zhang, Fuli Feng, Wenge Rong
Abstract:
Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity that is critical for collaborative signal learning, and semantic-collaborative signal entanglement, where collaborative and semantic signals exhibit distinct distributions yet are fused in a unified embedding space, leading to conflicting optimization objectives that limit the recommendation performance. To address these issues, we propose DiscRec, a novel framework that enables Disentangled Semantic-Collaborative signal modeling with flexible fusion for generative Recommendation. First, DiscRec introduces item-level position embeddings, assigned based on indices within each semantic ID, enabling explicit modeling of item structure in input token sequences. Second, DiscRec employs a dual-branch module to disentangle the two signals at the embedding layer: a semantic branch encodes semantic signals using original token embeddings, while a collaborative branch applies localized attention restricted to tokens within the same item to effectively capture collaborative signals. A gating mechanism subsequently fuses both branches while preserving the model's ability to model sequential dependencies. Extensive experiments on four real-world datasets demonstrate that DiscRec effectively decouples these signals and consistently outperforms state-of-the-art baselines. Our codes are available on https://github.com/Ten-Mao/DiscRec.
中文:DiscRec框架通过引入项目级位置嵌入和双分支门控融合机制,解决了生成式推荐中的标记-项目错位与语义-协同信号纠缠问题,在多个数据集上实现了最优性能。
English: The proposed DiscRec framework addresses token-item misalignment and semantic-collaborative signal entanglement in generative recommendation by introducing item-level position embeddings and a dual-branch module with gated fusion, achieving superior performance across multiple datasets.
Authors:Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
Abstract:
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
Chinese: Show-o2模型通过自回归建模和流匹配技术,在3D因果变分自编码器空间中构建统一视觉表征,采用双路径时空融合和两阶段训练方法,实现了跨图像与视频模态的可扩展多模态理解与生成能力。
English: The Show-o2 model introduces a unified multimodal framework using autoregressive modeling and flow matching within a 3D causal variational autoencoder space, enabling scalable image and video understanding and generation through a two-stage training process.
Authors:Xingrui Qin, Wentao Zhao, Chuan Cao, Yihe Niu, Tianchen Deng, Houcheng Jiang, Rui Guo, Jingchuan Wang
Abstract:
Dense depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, for guiding the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1\% compared to dense-supervised methods. RaCalNet is composed of two key modules. The Radar Recalibration module performs radar point screening and pixel-wise displacement refinement, producing accurate and reliable depth priors from sparse radar inputs. These priors are then used by the Metric Depth Optimization module, which learns to infer scene-level scale priors and fuses them with monocular depth predictions to achieve metrically accurate outputs. This modular design enhances structural consistency and preserves fine-grained geometric details. Despite relying solely on sparse supervision, RaCalNet produces depth maps with clear object contours and fine-grained textures, demonstrating superior visual quality compared to state-of-the-art dense-supervised methods. Quantitatively, it achieves performance comparable to existing methods on the ZJU-4DRadarCam dataset and yields a 34.89\% RMSE reduction in real-world deployment scenarios. We plan to gradually release the code and models in the future at https://github.com/818slam/RaCalNet.git.
中文: RaCalNet提出了一种新颖框架,通过稀疏激光雷达监督替代密集监督,从雷达和图像中实现精确的密集深度估计,大幅降低成本的同时保持甚至超越了现有方法的性能。
English: RaCalNet introduces a novel framework that replaces dense LiDAR supervision with sparse supervision to achieve accurate dense depth estimation from radar and images, significantly reducing costs while maintaining or even surpassing the performance of existing methods.
Authors:Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Richard H. Clayton, Chen Chen
Abstract:
Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task. Code is available at https://github.com/farheenjabeen/CLAIM-Scar-Synthesis.
中文:CLAIM框架通过临床引导和扩散模型生成解剖结构准确且多样化的心肌瘢痕图像,有效提升分割模型的鲁棒性和准确性,实验证明其生成的瘢痕模式更真实且与真实分布一致性更高。
English: CLAIM is a novel framework that uses clinical guidance and diffusion models to generate realistic, anatomically accurate myocardial scar images for robust segmentation, outperforming baseline methods in producing diverse and coherent scar patterns.
Authors:Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele
Abstract:
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.
中文摘要:本文首次提出基于随机平滑的认证框架,为任何黑盒归因方法提供像素级鲁棒性保证,确保归因结果在对抗扰动下的可靠性和可解释性。
English Summary: This paper introduces the first certification framework using randomized smoothing to guarantee pixel-level robustness for any black-box attribution method against adversarial perturbations, ensuring reliable and interpretable explanations.
Authors:Nikolay Blagoev, OÄuzhan Ersoy, Lydia Yiyu Chen
Abstract:
Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator's scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.
中文: CheckFree是一种高效的恢复方法,通过用相邻阶段的加权平均值替代故障阶段,无需额外计算或存储,其增强版CheckFree+通过乱序流水线执行进一步支持首尾阶段故障的恢复。
English: CheckFree is an efficient recovery method that replaces failed stages with weighted averages of neighboring stages, eliminating the need for additional computation or storage, and its enhanced version CheckFree+ extends this capability to handle first and last stage failures through out-of-order pipeline execution.
Authors:Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, Hongwei Feng, Yanghua Xiao, Zheyu Ye, Yao Hu, Shaosheng Cao
Abstract:
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.
Chinese: AgentGroupChat-V2提出了一种具有并行架构和自适应协作引擎的新型多智能体框架,在多个基准测试的复杂推理任务中展现出卓越性能。
English: AgentGroupChat-V2 introduces a novel multi-agent framework with a parallel architecture and adaptive collaboration engine, demonstrating superior performance in complex reasoning tasks across multiple benchmarks.
Authors:Guoguo Ai, Hezhe Qiao, Hui Yan, Guansong Pang
Abstract:
Semi-supervised graph anomaly detection (GAD) utilizes a small set of labeled normal nodes to identify abnormal nodes from a large set of unlabeled nodes in a graph. Current methods in this line posit that 1) normal nodes share a similar level of homophily and 2) the labeled normal nodes can well represent the homophily patterns in the normal class. However, this assumption often does not hold well since normal nodes in a graph can exhibit diverse homophily in real-world GAD datasets. In this paper, we propose RHO, namely Robust Homophily Learning, to adaptively learn such homophily patterns. RHO consists of two novel modules, adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns a set of adaptive spectral filters that capture different frequency components of the labeled normal nodes with varying homophily in the channel-wise and cross-channel views of node attributes. GNA is introduced to enforce consistency between the channel-wise and cross-channel homophily representations to robustify the normality learned by the filters in the two views. Experiments on eight real-world GAD datasets show that RHO can effectively learn varying, often under-represented, homophily in the small normal node set and substantially outperforms state-of-the-art competing methods. Code is available at https://github.com/mala-lab/RHO.
中文: 提出的RHO框架通过自适应频率响应滤波器和图正态对齐模块,有效学习正常节点中多样化的同质性模式,在八个真实世界图异常检测数据集上显著优于现有最优方法。
English: The proposed RHO framework addresses limitations in semi-supervised graph anomaly detection by introducing adaptive frequency response filters and graph normality alignment to effectively capture diverse homophily patterns in normal nodes, demonstrating superior performance across eight real-world datasets.
Authors:Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, Qingxiang Lin, Zeqiang Lai, Xianghui Yang, Huiwen Shi, Zibo Zhao, Bowen Zhang, Hongyu Yan, Lifu Wang, Sicong Liu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Dongyuan Guo, Junlin Yu, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Shida Wei, Chao Zhang, Yonghao Tan, Yifu Sun, Lin Niu, Shirui Huang, Bojian Zheng, Shu Liu, Shilin Chen, Xiang Yuan, Xiaofeng Yang, Kai Liu, Jianchen Zhu, Peng Chen, Tian Liu, Di Wang, Yuhong Liu, Linus, Jie Jiang, Jingwei Huang, Chunchao Guo
Abstract:
3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.
中文: 本教程以Hunyuan3D 2.1为例,系统介绍三维数据处理、生成模型训练与性能评估的全流程,帮助开发者掌握创建适用于游戏和设计领域的高质量三维资产的能力。
English: This tutorial presents Hunyuan3D 2.1 as a comprehensive guide for processing 3D data, training generative models, and evaluating performance to create high-resolution 3D assets for gaming and design applications.
Authors:Damin Kühn, Michael T. Schaub
Abstract:
Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.
Chinese: 该方法仅利用分布层面的类别标签学习最优传输的全局基础度量,无需预定义或监督学习即可提升聚类和分类等任务的准确性。
English: The proposed method learns a global ground metric for optimal transport using only distribution-level class labels, enhancing accuracy in tasks like clustering and classification without requiring predefined or supervised metrics.
Authors:J. Thorben Frank, Winfried Ripken, Gregor Lied, Klaus-Robert Müller, Oliver T. Unke, Stefan Chmiela
Abstract:
Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling, particularly in image synthesis, making them a compelling choice for molecular conformer generation. However, applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry, handling Euclidean symmetries, and designing conditioning mechanisms that generalize across molecules of varying sizes and structures. We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture that separates the processing of 3D coordinates from conditioning on atomic connectivity. To this end, we introduce two complementary graph-based conditioning strategies that integrate seamlessly with the DiT architecture. These are combined with different attention mechanisms, including both standard non-equivariant and SO(3)-equivariant formulations, enabling flexible control over the trade-off between between accuracy and computational efficiency. Experiments on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL) demonstrate that DiTMC achieves state-of-the-art precision and physical validity. Our results highlight how architectural choices and symmetry priors affect sample quality and efficiency, suggesting promising directions for large-scale generative modeling of molecular structures. Code available at https://github.com/ML4MolSim/dit_mc.
中文: DiTMC通过结合图结构条件与等变注意力机制,将扩散变换器应用于分子构象生成,在标准测试中实现了最优精度和物理有效性。
English: DiTMC adapts Diffusion Transformers for molecular conformer generation by integrating graph-based conditioning and equivariant attention, achieving state-of-the-art precision and physical validity on benchmarks.
Authors:Niki Amini-Naieni, Andrew Zisserman
Abstract:
We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.
中文摘要:本文提出CountVid模型,通过文本或图像查询在视频中实现开放世界的物体计数,能有效追踪并统计不同物体实例,并在新构建的VideoCount数据集上验证了其优越性能。
English Summary: This paper introduces CountVid, a model for open-world object counting in videos that uses text or image queries to identify and track unique object instances across frames, and demonstrates its superior performance on the new VideoCount dataset.
Authors:A. S. Stankevich, I. B. Petrov
Abstract:
Recent developments in application of deep learning models to acoustic Full Waveform Inversion (FWI) are marked by the use of diffusion models as prior distributions for Bayesian-like inference procedures. The advantage of these methods is the ability to generate high-resolution samples, which are otherwise unattainable with classical inversion methods or other deep learning-based solutions. However, the iterative and stochastic nature of sampling from diffusion models along with heuristic nature of output control remain limiting factors for their applicability. For instance, an optimal way to include the approximate velocity model into diffusion-based inversion scheme remains unclear, even though it is considered an essential part of FWI pipeline. We address the issue by employing a Schrödinger Bridge that interpolates between the distributions of ground truth and smoothed velocity models. To facilitate the learning of nonlinear drifts that transfer samples between distributions we extend the concept of Image-to-Image Schrödinger Bridge ($\text{I}^2\text{SB}$) to conditional sampling, resulting in a conditional Image-to-Image Schrödinger Bridge (c$\text{I}^2\text{SB}$) framework. To validate our method, we assess its effectiveness in reconstructing the reference velocity model from its smoothed approximation, coupled with the observed seismic signal of fixed shape. Our experiments demonstrate that the proposed solution outperforms our reimplementation of conditional diffusion model suggested in earlier works, while requiring only a few neural function evaluations (NFEs) to achieve sample fidelity superior to that attained with supervised learning-based approach. The supplementary code implementing the algorithms described in this paper can be found in the repository https://github.com/stankevich-mipt/seismic_inversion_via_I2SB.
Chinese: 深度学习在全波形反演中的应用近期采用扩散模型作为先验分布,但受限于采样效率和启发式控制,本文通过引入条件图像到图像薛定谔桥(cI²SB)框架,以更少神经网络评估实现优于现有方法的样本保真度。
English: Recent advances in deep learning for acoustic Full Waveform Inversion utilize diffusion models as priors, but face limitations in sampling efficiency and control, which are addressed by introducing a conditional Image-to-Image Schrödinger Bridge (cI²SB) framework that outperforms prior methods with fewer neural evaluations.
Authors:Lanfeng Zhong, Xin Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
Abstract:
Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..
中文: 本文提出OpenPath这一新型开放集主动学习方法,利用预训练视觉语言模型在病理图像分类中有效筛选信息丰富的分布内样本并排除分布外数据,实验证明其性能优于现有先进方法。
English: This paper introduces OpenPath, a novel open-set active learning method that utilizes a pre-trained vision-language model to efficiently select informative in-distribution samples while avoiding out-of-distribution data in pathological image classification, demonstrating superior performance over existing methods.
Authors:Leonid Ivanov, Vasily Yuryev, Dmitry Yudin
Abstract:
In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.
Chinese: 本文提出MapFM增强型端到端模型,通过融合基础模型进行图像编码并在鸟瞰图表示中集成辅助语义分割头,显著提升了在线矢量化高精地图生成的准确性和质量。
English: This paper presents MapFM, an enhanced end-to-end model that improves online vectorized HD map generation by integrating foundation models for image encoding and auxiliary semantic segmentation heads in BEV representation, resulting in higher accuracy and quality.
Authors:Han Wu, Junyao Li, Kangbo Zhao, Sen Zhang, Yukai Shi, Liang Lin
Abstract:
Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: https://github.com/HanWu3125/OS-Sketch
中文: 本文提出了一种基于扩散模型的单次人脸素描合成方法,通过优化文本指令实现仅需一对样本即可将照片转化为逼真素描,并建立了新的基准数据集验证其优越性能。
English: This paper introduces a one-shot face sketch synthesis method using diffusion models, which optimizes text instructions for converting photos into realistic sketches with minimal training data, and validates its effectiveness through a new benchmark dataset.
Authors:Zihao Li, Qiang Chen, Lixin Zou, Aixin Sun, Chenliang Li
Abstract:
Existing recommendation methods often struggle to model users' multifaceted preferences due to the diversity and volatility of user behavior, as well as the inherent uncertainty and ambiguity of item attributes in practical scenarios. Multi-interest recommendation addresses this challenge by extracting multiple interest representations from users' historical interactions, enabling fine-grained preference modeling and more accurate recommendations. It has drawn broad interest in recommendation research. However, current recommendation surveys have either specialized in frontier recommendation methods or delved into specific tasks and downstream applications. In this work, we systematically review the progress, solutions, challenges, and future directions of multi-interest recommendation by answering the following three questions: (1) Why is multi-interest modeling significantly important for recommendation? (2) What aspects are focused on by multi-interest modeling in recommendation? and (3) How can multi-interest modeling be applied, along with the technical details of the representative modules? We hope that this survey establishes a fundamental framework and delivers a preliminary overview for researchers interested in this field and committed to further exploration. The implementation of multi-interest recommendation summarized in this survey is maintained at https://github.com/WHUIR/Multi-Interest-Recommendation-A-Survey.
中文摘要:本综述通过阐述多兴趣建模的重要性、关注方向和技术实现,系统回顾了多兴趣推荐的研究进展,为研究者理解这种通过多重兴趣表征来建模用户多样化偏好的方法提供了基础框架。
English summary: This survey systematically reviews multi-interest recommendation by addressing its importance, focus areas, and technical implementations to help researchers understand this approach that models users' diverse preferences through multiple interest representations.
Authors:Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
Abstract:
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.
中文摘要:本研究提出一种生成合成数据的方法,使数据集推断能够有效检测大型语言模型中未经授权的训练数据使用,克服了需要不可得保留数据集的限制,并在多样实验中展现出高准确率和低误报率。
English Summary: This study introduces a method to generate synthetic data that enables Dataset Inference to effectively detect unauthorized use of training data in Large Language Models, overcoming the limitation of requiring unavailable held-out datasets and demonstrating high accuracy with low false positives in diverse experiments.
Authors:Jan van Delden, Julius Schultz, Sebastian Rothe, Christian Libner, Sabine C. Langer, Timo Lüddecke
Abstract:
Structural vibrations are a source of unwanted noise in engineering systems like cars, trains or airplanes. Minimizing these vibrations is crucial for improving passenger comfort. This work presents a novel design optimization approach based on guided flow matching for reducing vibrations by placing beadings (indentations) in plate-like structures. Our method integrates a generative flow matching model and a surrogate model trained to predict structural vibrations. During the generation process, the flow matching model pushes towards manufacturability while the surrogate model pushes to low-vibration solutions. The flow matching model and its training data implicitly define the design space, enabling a broader exploration of potential solutions as no optimization of manually-defined design parameters is required. We apply our method to a range of differentiable optimization objectives, including direct optimization of specific eigenfrequencies through careful construction of the objective function. Results demonstrate that our method generates diverse and manufacturable plate designs with reduced structural vibrations compared to designs from random search, a criterion-based design heuristic and genetic optimization. The code and data are available from https://github.com/ecker-lab/Optimizing_Vibrating_Plates.
Chinese: 本研究提出了一种基于引导流匹配的新颖设计优化方法,通过整合生成模型和代理模型来减少板状结构中的振动,相比传统方法,能更有效地生成可制造且低振动的设计。
English: This study introduces a novel design optimization method using guided flow matching to reduce structural vibrations in plates by integrating generative and surrogate models, which produces manufacturable, low-vibration designs more effectively than traditional approaches.
Authors:Yuchuan Fu, Xiaohan Yuan, Dongxia Wang
Abstract:
The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution. RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats. We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings. Notably, scaling laws held for security capabilities, with larger models outperforming smaller counterparts. Our findings expose critical risks in real-world agent deployments and provide a foundational framework for future security research. Code and data are available at https://github.com/lanzer-tree/RAS-Eval.
中文: RAS-Eval基准测试揭示了LLM代理的严重安全风险,攻击使任务完成率下降36.78%,同时较大模型展现出更优的安全防护能力。
English: The RAS-Eval benchmark exposes critical vulnerabilities in LLM agents, showing attacks reduce task completion by 36.78% while larger models demonstrate stronger security capabilities.
Authors:Liangjie Meng, Danxia Li, Jinrong He, Lili Ma, Zhixin Li
Abstract:
Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.
Chinese: 合成孔径雷达(SAR)能够实现全天候高分辨率成像,但在船舶检测中面临目标尺度多变和背景复杂等挑战;本文提出的C-AFBiFPN框架通过增强特征表征与融合能力,有效提升了检测精度和鲁棒性。
English: Synthetic Aperture Radar (SAR) enables high-resolution, all-weather imaging but faces challenges in ship detection due to scale variations and complex backgrounds, which the proposed C-AFBiFPN framework addresses by enhancing feature representation and fusion to improve accuracy and robustness.
Authors:Yufeng Zhang, Wenrui Dai, Hang Yu, Shizhan Liu, Junhui Hou, Jianguo Li, Weiyao Lin
Abstract:
Neural Image Compression (NIC) has revolutionized image compression with its superior rate-distortion performance and multi-task capabilities, supporting both human visual perception and machine vision tasks. However, its widespread adoption is hindered by substantial computational demands. While existing approaches attempt to address this challenge through module-specific optimizations or pre-defined complexity levels, they lack comprehensive control over computational complexity. We present ABC (Adaptive BayesNet structure learning for computational scalable multi-task image Compression), a novel, comprehensive framework that achieves computational scalability across all NIC components through Bayesian network (BayesNet) structure learning. ABC introduces three key innovations: (i) a heterogeneous bipartite BayesNet (inter-node structure) for managing neural backbone computations; (ii) a homogeneous multipartite BayesNet (intra-node structure) for optimizing autoregressive unit processing; and (iii) an adaptive control module that dynamically adjusts the BayesNet structure based on device capabilities, input data complexity, and downstream task requirements. Experiments demonstrate that ABC enables full computational scalability with better complexity adaptivity and broader complexity control span, while maintaining competitive compression performance. Furthermore, the framework's versatility allows integration with various NIC architectures that employ BayesNet representations, making it a robust solution for ensuring computational scalability in NIC applications. Code is available in https://github.com/worldlife123/cbench_BaSIC.
中文: ABC通过贝叶斯网络结构学习提出了一种新颖框架,实现了神经图像压缩的计算可扩展性,可在所有组件中自适应控制复杂度,同时保持优异的压缩性能。
English: ABC introduces a novel framework using Bayesian network structure learning to achieve computational scalability in neural image compression, enabling adaptive complexity control across all components while maintaining competitive performance.
Authors:Quanjun Zhang, Chunrong Fang, Siqi Gu, Ye Shang, Zhenyu Chen, Liang Xiao
Abstract:
Unit testing is a fundamental practice in modern software engineering, with the aim of ensuring the correctness, maintainability, and reliability of individual software components. Very recently, with the advances in Large Language Models (LLMs), a rapidly growing body of research has leveraged LLMs to automate various unit testing tasks, demonstrating remarkable performance and significantly reducing manual effort. However, due to ongoing explorations in the LLM-based unit testing field, it is challenging for researchers to understand existing achievements, open challenges, and future opportunities. This paper presents the first systematic literature review on the application of LLMs in unit testing until March 2025. We analyze \numpaper{} relevant papers from the perspectives of both unit testing and LLMs. We first categorize existing unit testing tasks that benefit from LLMs, e.g., test generation and oracle generation. We then discuss several critical aspects of integrating LLMs into unit testing research, including model usage, adaptation strategies, and hybrid approaches. We further summarize key challenges that remain unresolved and outline promising directions to guide future research in this area. Overall, our paper provides a systematic overview of the research landscape to the unit testing community, helping researchers gain a comprehensive understanding of achievements and promote future research. Our artifacts are publicly available at the GitHub repository: https://github.com/iSEngLab/AwesomeLLM4UT.
中文: 本文首次系统综述了大型语言模型在单元测试中的应用,通过分析研究趋势、分类应用场景,并指出未来挑战与机遇,为该领域提供全面指导。
English: This paper presents the first systematic literature review on applying Large Language Models (LLMs) to automate unit testing tasks, analyzing research trends, categorizing applications, and identifying future challenges and opportunities in the field.
Authors:Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Abstract:
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
Chinese: Video-SALMONN 2 通过多轮直接偏好优化(MrDPO)和字幕质量目标,在视频描述和问答任务中实现了最先进的性能,超越了GPT-4o和Gemini-1.5 Pro等专有系统,并在多个基准测试中表现优异。
English: Video-SALMONN 2 introduces multi-round direct preference optimization (MrDPO) with a caption-quality objective, achieving state-of-the-art results in video description and question answering across multiple benchmarks while outperforming proprietary systems like GPT-4o and Gemini-1.5 Pro.
Authors:Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Abstract:
We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models, transferring benefits beyond captioning to strong performance on complex video-QA tasks. Across widely used audio-visual and visual-only understanding benchmarks (including Video-MME, WorldSense, AVUT, Video-Holmes, DailyOmni, MLVU, and LVBench), our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems. Our source code, models, and data are released at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
Chinese: Video-SALMONN 2 通过多轮直接偏好优化(MrDPO)和字幕质量目标,在视频描述和问答任务中实现了最先进的性能,超越了GPT-4o和Gemini-1.5 Pro等专有系统,并在多个基准测试中表现优异。
English: Video-SALMONN 2 introduces multi-round direct preference optimization (MrDPO) with a caption-quality objective, achieving state-of-the-art results in video description and question answering across multiple benchmarks while outperforming proprietary systems like GPT-4o and Gemini-1.5 Pro.
Authors:Dan He, Weisheng Li, Guofen Wang, Yuping Huang, Shiqiang Liu
Abstract:
Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.
Chinese: 本研究提出了一种基于扩散模型的两阶段融合网络(DM-FNet),通过增强特征提取和跨模态交互来优化多模态医学图像融合,使融合图像在亮度、对比度和细节方面表现卓越。
English: This study introduces a two-stage diffusion model-based fusion network (DM-FNet) that enhances multimodal medical image fusion by improving feature capture and cross-modal interaction, resulting in superior brightness, contrast, and detail in fused images.
Authors:Xianliang Yang, Ling Zhang, Haolong Qian, Lei Song, Jiang Bian
Abstract:
Heuristic algorithms play a vital role in solving combinatorial optimization (CO) problems, yet traditional designs depend heavily on manual expertise and struggle to generalize across diverse instances. We introduce \textbf{HeurAgenix}, a two-stage hyper-heuristic framework powered by large language models (LLMs) that first evolves heuristics and then selects among them automatically. In the heuristic evolution phase, HeurAgenix leverages an LLM to compare seed heuristic solutions with higher-quality solutions and extract reusable evolution strategies. During problem solving, it dynamically picks the most promising heuristic for each problem state, guided by the LLM's perception ability. For flexibility, this selector can be either a state-of-the-art LLM or a fine-tuned lightweight model with lower inference cost. To mitigate the scarcity of reliable supervision caused by CO complexity, we fine-tune the lightweight heuristic selector with a dual-reward mechanism that jointly exploits singals from selection preferences and state perception, enabling robust selection under noisy annotations. Extensive experiments on canonical benchmarks show that HeurAgenix not only outperforms existing LLM-based hyper-heuristics but also matches or exceeds specialized solvers. Code is available at https://github.com/microsoft/HeurAgenix.
中文: HeurAgenix是一种创新的两阶段超启发式框架,利用大语言模型进化启发式规则并动态选择最优策略来解决组合优化问题,在标准测试中不仅超越了现有基于大语言模型的方法,甚至媲美或优于专业求解器。
English: HeurAgenix is a novel two-stage hyper-heuristic framework that uses large language models to evolve heuristics and dynamically select the most effective ones for solving combinatorial optimization problems, demonstrating superior performance over existing methods and specialized solvers in benchmarks.
Authors:Jiaqi Shi, Jin Xiao, Xiaoguang Hu, Boyang Song, Hao Jiang, Tianyou Chen, Baochang Zhang
Abstract:
Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution
中文摘要:提出的点分布集合抽象(PDSA)模块通过在高维空间校正特征分布,提升了点云分析的计算效率和鲁棒性,并以更少参数实现了显著性能提升。
English Summary: The proposed Point Distribution Set Abstraction (PDSA) module improves point cloud analysis by correcting feature distribution in high-dimensional space, enhancing computational efficiency and robustness while achieving significant performance gains with fewer parameters.
Authors:Yushi Wang, Penghui Chen, Xinyu Han, Feng Wu, Mingguo Zhao
Abstract:
Recent advancements in reinforcement learning (RL) have led to significant progress in humanoid robot locomotion, simplifying the design and training of motion policies in simulation. However, the numerous implementation details make transferring these policies to real-world robots a challenging task. To address this, we have developed a comprehensive code framework that covers the entire process from training to deployment, incorporating common RL training methods, domain randomization, reward function design, and solutions for handling parallel structures. This library is made available as a community resource, with detailed descriptions of its design and experimental results. We validate the framework on the Booster T1 robot, demonstrating that the trained policies seamlessly transfer to the physical platform, enabling capabilities such as omnidirectional walking, disturbance resistance, and terrain adaptability. We hope this work provides a convenient tool for the robotics community, accelerating the development of humanoid robots. The code can be found in https://github.com/BoosterRobotics/booster_gym.
中文摘要:本研究开发了一个完整的强化学习框架,有效解决了人形机器人运动策略从仿真到实物的迁移难题,并在Booster T1机器人上成功验证了全向行走、抗干扰和地形适应等能力。
English Summary: This study presents a comprehensive reinforcement learning framework that facilitates the transfer of simulated humanoid locomotion policies to real-world robots, validated through successful deployment on the Booster T1 platform enabling robust movement capabilities.
Authors:Junke Wang, Hongshun Ling, Li Zhang, Longqian Zhang, Fang Wang, Yuan Gao, Zhi Li
Abstract:
Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.
Chinese: 本研究提出CKD-EHR框架,通过知识蒸馏技术将经过医学知识增强的大型语言模型作为教师模型,将其知识迁移至轻量级BERT学生模型,在MIMIC-III数据集上显著提升了诊断准确率、F1分数和推理速度。
English: This study introduces the CKD-EHR framework, which uses knowledge distillation to enhance disease prediction by fine-tuning a large language model as a teacher and transferring its knowledge to a lightweight BERT model, achieving significant improvements in accuracy, F1-score, and inference speed on the MIMIC-III dataset.
Authors:Paige TuttösÃ, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, Angelica Lim
Abstract:
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine-grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real-time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.
中文摘要:人类在长时间说话时会自然变化表达力以保持听众参与,而社交机器人常缺乏这种动态特性;EmojiVoice通过表情符号提示提供可定制的实时语音合成工具,能实现富有表现力的机器人语音,但其效果因应用场景而异。
English Summary: Humans naturally vary their speech expressivity over time to engage listeners, while social robots often lack this dynamic quality, but EmojiVoice offers a customizable, real-time text-to-speech toolkit using emoji prompts to enable expressive, variable robotic speech, with effectiveness varying by application context.
Authors:Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber
Abstract:
Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.
Chinese: PrefBERT是一种新颖的评分模型,旨在通过提供语义奖励反馈来评估开放式生成长文本,其表现优于ROUGE-L和BERTScore等传统指标,并在训练策略模型时更好地符合人类偏好。
English: PrefBERT is a novel scoring model designed to evaluate open-ended long-form generation by providing semantic reward feedback, outperforming traditional metrics like ROUGE-L and BERTScore and aligning better with human preferences in training policy models.
Authors:Yijun Lin, Yao-Yi Chiang
Abstract:
Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.
中文: 历史地图文本提取因样式多样和背景复杂而困难,但PALETTE采用超局部采样模块和SynthMap+合成训练数据,能有效检测和识别文本,性能优于现有方法,并已成功处理超过6万张地图。
English: Historical map text extraction is challenging due to diverse styles and complex backgrounds, but PALETTE introduces a hyper-local sampling module and SynthMap+ synthetic training data to effectively detect and recognize text, outperforming existing methods and successfully processing over 60,000 maps.
Authors:Marissa Dominijanni, Alexander Ororbia, Kenneth W. Regan
Abstract:
Synaptic delays play a crucial role in biological neuronal networks, where their modulation has been observed in mammalian learning processes. In the realm of neuromorphic computing, although spiking neural networks (SNNs) aim to emulate biology more closely than traditional artificial neural networks do, synaptic delays are rarely incorporated into their simulation. We introduce a novel learning rule for simultaneously learning synaptic connection strengths and delays, by extending spike-timing dependent plasticity (STDP), a Hebbian method commonly used for learning synaptic weights. We validate our approach by extending a widely-used SNN model for classification trained with unsupervised learning. Then we demonstrate the effectiveness of our new method by comparing it against another existing methods for co-learning synaptic weights and delays as well as against STDP without synaptic delays. Results demonstrate that our proposed method consistently achieves superior performance across a variety of test scenarios. Furthermore, our experimental results yield insight into the interplay between synaptic efficacy and delay.
中文: 本研究提出了一种新颖的学习规则,通过扩展脉冲时序依赖可塑性来同时学习脉冲神经网络的突触权重和延迟,实验结果表明该方法在多种测试场景中均表现优异,并揭示了突触效能与延迟之间的相互作用。
English: This study introduces a novel learning rule that extends spike-timing dependent plasticity to simultaneously learn synaptic weights and delays in spiking neural networks, demonstrating superior performance and providing insights into synaptic efficacy-delay interactions across various test scenarios.
Authors:Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Abstract:
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
中文: Guru语料库通过涵盖六个领域的9.2万个多样化推理实例,解决了强化学习在语言模型中应用范围有限的问题,其Guru-7B和Guru-32B模型实现了最先进性能,证明强化学习既能激发既有知识,也能在预训练不足的领域促成真正的技能习得。
English: The Guru corpus introduces 92K diverse reasoning examples across six domains to address the limited scope of reinforcement learning in language models, enabling models like Guru-7B and Guru-32B to achieve state-of-the-art performance by demonstrating that RL can both elicit existing knowledge and foster genuine skill acquisition in underrepresented areas.
Authors:Adriana Watson
Abstract:
The decentralized finance (DeFi) community has grown rapidly in recent years, pushed forward by cryptocurrency enthusiasts interested in the vast untapped potential of new markets. The surge in popularity of cryptocurrency has ushered in a new era of financial crime. Unfortunately, the novelty of the technology makes the task of catching and prosecuting offenders particularly challenging. Thus, it is necessary to implement automated detection tools related to policies to address the growing criminality in the cryptocurrency realm.
中文: 去中心化金融的快速发展导致金融犯罪增加,由于技术新颖性带来的追诉挑战,亟需实施自动化检测工具来应对加密货币领域的犯罪增长。
English: The rapid growth of decentralized finance (DeFi) has led to increased financial crime, necessitating automated detection tools to combat challenges in prosecuting offenders due to the technology's novelty.
Authors:Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, Maksym Andriushchenko
Abstract:
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
中文: OS-Harm 是一个新基准,旨在通过涉及多种操作系统应用的150项任务和与人工评估高度一致的自动化评判器,测试基于大语言模型的计算机使用代理在故意滥用、提示注入攻击和模型不当行为这三类危害中的安全性。
English: OS-Harm is a new benchmark designed to evaluate the safety of LLM-based computer use agents by testing them across three categories of harm—deliberate misuse, prompt injections, and model misbehavior—through 150 tasks involving various OS applications and an automated judge that aligns closely with human evaluations.
Authors:Bharath Dandala, Michael M. Danziger, Ella Barkan, Tanwi Biswas, Viatcheslav Gurev, Jianying Hu, Matthew Madgwick, Akira Koseki, Tal Kozlovski, Michal Rosen-Zvi, Yishai Shimoni, Ching-Huei Tsou
Abstract:
Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field's ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( https://github.com/BiomedSciAI/biomed-multi-omic ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology.
中文:BMFM-RNA作为一个开源模块化框架,整合了转录组基础模型的训练目标并创新引入WCED方法,在多项基准测试中展现卓越性能,同时为优化模型策略提供了可复现的研究基础。
English: BMFM-RNA is an open-source modular framework that unifies transcriptomic foundation model training and introduces the WCED objective, demonstrating state-of-the-art performance across diverse datasets while providing reproducible benchmarking for optimal model development.
Authors:Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop
Abstract:
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.
中文:PictSure是一个强调图像嵌入关键作用的上下文学习框架,通过系统分析表明预训练策略显著影响小样本图像分类的跨域性能,在挑战性基准测试中取得了优异成果。
English: PictSure is an in-context learning framework that emphasizes the critical role of image embeddings, demonstrating through systematic analysis that pretraining strategies significantly influence out-of-domain performance in few-shot image classification, achieving superior results on challenging benchmarks.
Authors:Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, Jose M. Alvarez
Abstract:
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations. To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. Regarding prompt variations, PARC's evaluation shows that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family. We further find indications that prompt sensitivity is linked to training data. The code will be at https://github.com/NVlabs/PARC.
Chinese: 视觉语言模型对提示变化具有敏感性,类似于大语言模型,其中InternVL2系列在22个评估模型中表现出最强的鲁棒性,这是通过PARC框架在语言和视觉领域进行可靠性和校准分析得出的结论。
English: Vision language models (VLMs) exhibit sensitivity to prompt variations similar to large language models, with the InternVL2 family showing the most robustness among 22 evaluated models, as analyzed through the PARC framework that assesses reliability and calibration across language and vision domains.
Authors:Evdoxia Taka, Debadyuti Bhattacharya, Joanne Garde-Hansen, Sanjay Sharma, Tanaya Guha
Abstract:
Recent advances in AI has made automated analysis of complex media content at scale possible while generating actionable insights regarding character representation along such dimensions as gender and age. Past works focused on quantifying representation from audio/video/text using AI models, but without having the audience in the loop. We ask, even if character distribution along demographic dimensions are available, how useful are those to the general public? Do they actually trust the numbers generated by AI models? Our work addresses these open questions by proposing a new AI-based character representation tool and performing a thorough user study. Our tool has two components: (i) An analytics extraction model based on the Contrastive Language Image Pretraining (CLIP) foundation model that analyzes visual screen data to quantify character representation across age and gender; (ii) A visualization component effectively designed for presenting the analytics to lay audience. The user study seeks empirical evidence on the usefulness and trustworthiness of the AI-generated results for carefully chosen movies presented in the form of our visualizations. We found that participants were able to understand the analytics in our visualizations, and deemed the tool `overall useful'. Participants also indicated a need for more detailed visualizations to include more demographic categories and contextual information of the characters. Participants' trust in AI-based gender and age models is seen to be moderate to low, although they were not against the use of AI in this context. Our tool including code, benchmarking, and the user study data can be found at https://github.com/debadyuti0510/Character-Representation-Media.
中文摘要:本研究开发了一种基于CLIP模型分析媒体角色表征的AI工具,用户研究显示该工具被认为具有实用性,但参与者对AI生成的年龄性别数据信任度普遍中等偏低。
English Summary: This research introduces an AI-powered tool that quantifies character representation in media through CLIP-based analytics and visualizations, finding it useful but revealing moderate to low trust in AI-generated demographics among users.
Authors:Zhangyang Gao, Hao Wang, Cheng Tan, Chenrui Xu, Mengdi Liu, Bozhen Hu, Linlin Chao, Xiaoming Zhang, Stan Z. Li
Abstract:
This study investigates the current landscape and future directions of protein foundation model research. While recent advancements have transformed protein science and engineering, the field lacks a comprehensive benchmark for fair evaluation and in-depth understanding. Since ESM-1B, numerous protein foundation models have emerged, each with unique datasets and methodologies. However, evaluations often focus on limited tasks tailored to specific models, hindering insights into broader generalization and limitations. Specifically, researchers struggle to understand the relationships between tasks, assess how well current models perform across them, and determine the criteria in developing new foundation models. To fill this gap, we present PFMBench, a comprehensive benchmark evaluating protein foundation models across 38 tasks spanning 8 key areas of protein science. Through hundreds of experiments on 17 state-of-the-art models across 38 tasks, PFMBench reveals the inherent correlations between tasks, identifies top-performing models, and provides a streamlined evaluation protocol. Code is available at \href{https://github.com/biomap-research/PFMBench}{\textcolor{blue}{GitHub}}.
中文: 本研究推出PFMBench基准测试,通过评估17个蛋白质基础模型在38项任务中的表现,填补了该领域缺乏标准化评估的空白,揭示了任务间关联并识别出最优模型。
English: This study introduces PFMBench, a comprehensive benchmark addressing the lack of standardized evaluation for protein foundation models by assessing 17 models across 38 tasks to reveal task correlations and top performers.
Authors:Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky
Abstract:
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.
Chinese Summary: 本文提出了一种端到端的变分方法,能自动学习将连续语音属性编码到语义标记中,无需手动特征工程,从而提高了生成语音的自然度。
English Summary: This paper introduces an end-to-end variational method that automatically learns to encode continuous speech attributes into semantic tokens, eliminating manual feature engineering and improving speech naturalness in generated outputs.
Authors:Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.
中文: LC-R1采用基于群组相对策略优化的后训练方法,通过简洁性和充分性原则结合长度与压缩奖励,在推理链长度减少约50%的同时仅造成约2%的准确率下降。
English: LC-R1, a post-training method using Group Relative Policy Optimization, significantly reduces reasoning chain length by 50% with minimal accuracy loss by applying Brevity and Sufficiency principles through Length and Compress Rewards.
Authors:Dahang Wan, Rongsheng Lu, Yang Fang, Xianli Lang, Shuangbao Shu, Jingjing Chen, Siyuan Shen, Ting Xu, Zecong Ye
Abstract:
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
Chinese: YOLOv11-RGBT框架提出了六种多光谱融合模式和创新的中融合与可控微调策略,显著提升了多光谱目标检测的精度和鲁棒性,在FLIR数据集上mAP最高提升达5.65%。
English: The YOLOv11-RGBT framework introduces six multispectral fusion modes and innovative strategies like mid-fusion and controllable fine-tuning, significantly enhancing detection accuracy and robustness across datasets, with mAP improvements up to 5.65% on FLIR.
Authors:Hengyuan Zhang, Xinrong Chen, Yingmin Qiu, Xiao Liang, Ziyue Li, Guanyu Wang, Weiping Li, Tong Mo, Hayden Kwok-Hay So, Ngai Wong
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at https://github.com/Liar406/Gui-LoMo.git.
中文: GuiLoMo提出了一种基于引导选择向量的细粒度策略,逐层自适应分配专家数量和秩,从而在保持效率的同时提升了LoRA-MoE在不同任务和模型上的性能表现。
English: GuiLoMo introduces a fine-grained strategy using GuidedSelection Vectors to adaptively allocate expert numbers and ranks per layer, enhancing LoRA-MoE's performance across diverse tasks and models while maintaining efficiency.
Authors:Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.
中文: 3D高斯泼溅技术虽能实现高质量实时新视角合成,但存储需求巨大,为此我们构建了首个大规模图像质量评估数据集3DGS-IEval-15K,通过主观实验和多元指标系统评估压缩3DGS的感知质量。
English: 3D Gaussian Splatting enables real-time novel view synthesis with high visual quality but faces storage challenges, leading to the creation of 3DGS-IEval-15K, the first large-scale dataset for evaluating perceptual quality in compressed 3DGS representations through human assessments and diverse metrics.
Authors:Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda
Abstract:
Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the best results, with a final loss of 0.0028, a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito
中文: 本文介绍了VisText-Mosquito多模态数据集,它整合视觉与文本数据以改进蚊虫孳生地的自动检测、分割和推理,先进模型实现了高精度并支持"预防胜于治疗"的主动疾病防控理念。
English: This paper introduces VisText-Mosquito, a multimodal dataset combining visual and textual data to enhance automated detection, segmentation, and reasoning for mosquito breeding sites, with advanced models achieving high precision and supporting proactive disease prevention.
Authors:Ziyu Gong, Jim Lim, David I. Inouye
Abstract:
Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse. While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train. We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions. Our key insight is that gradient-based DM training only requires the prior's score function -- not its density -- allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available at https://github.com/inouye-lab/SAUB.
中文: 本文提出了一种基于分数的先验分布新方法,用于基于似然的分布匹配,避免了固定先验和显式密度模型的偏差,并在多个任务中展现出更优的稳定性、效率和性能。
English: This paper introduces a novel score-based prior distribution method for likelihood-based distribution matching, which avoids biases from fixed priors and explicit density models, demonstrating improved stability, efficiency, and performance across various tasks.
Authors:Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal
Abstract:
This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.
中文: 本研究提出一种基于非配对数据集和条件流匹配的图像复原方法,能够同时学习退化模型和复原过程,在去模糊和点扩散函数校准任务中表现优异,且仅需少量数据即可实现。
English: This study introduces a novel image restoration method that uses unpaired datasets and conditional flow matching to learn both the degradation model and restoration process, achieving superior performance in deblurring and PSF calibration tasks with minimal data requirements.
Authors:Ming Xu, Xu Zhang
Abstract:
Existing monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. As a result, they often produce implausible poses under joint occlusions or rapid motion changes. To address these challenges, we propose the PoseGRAF framework. We first construct a dual graph convolutional structure that separately processes joint and bone graphs, effectively capturing their local dependencies. A Cross-Attention module is then introduced to model interdependencies between bone directions and joint features. Building upon this, a dynamic fusion module is designed to adaptively integrate both feature types by leveraging the relational dependencies between joints and bones. An improved Transformer encoder is further incorporated in a residual manner to generate the final output. Experimental results on the Human3.6M and MPI-INF-3DHP datasets show that our method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. The code is publicly available at https://github.com/iCityLab/PoseGRAF.
中文:PoseGRAF框架通过双图卷积和跨注意力机制捕捉骨骼关节间的方向关联,解决了现有单目三维姿态估计方法的局限性,在多个基准数据集上实现了最先进的性能。
English: The PoseGRAF framework addresses limitations in monocular 3D pose estimation by employing dual graph convolutions and cross-attention mechanisms to capture directional correlations between joints and bones, achieving state-of-the-art performance on benchmark datasets.
Authors:Ren Xin, Hongji Liu, Xiaodong Mei, Wenru Liu, Maosheng Ye, Zhili Chen, Jun Ma
Abstract:
Integrating General Models (GMs) such as Large Language Models (LLMs), with Specialized Models (SMs) in autonomous driving tasks presents a promising approach to mitigating challenges in data diversity and model capacity of existing specialized driving models. However, this integration leads to problems of asynchronous systems, which arise from the distinct characteristics inherent in GMs and SMs. To tackle this challenge, we propose NetRoller, an adapter that incorporates a set of novel mechanisms to facilitate the seamless integration of GMs and specialized driving models. Specifically, our mechanisms for interfacing the asynchronous GMs and SMs are organized into three key stages. NetRoller first harvests semantically rich and computationally efficient representations from the reasoning processes of LLMs using an early stopping mechanism, which preserves critical insights on driving context while maintaining low overhead. It then applies learnable query embeddings, nonsensical embeddings, and positional layer embeddings to facilitate robust and efficient cross-modality translation. At last, it employs computationally efficient Query Shift and Feature Shift mechanisms to enhance the performance of SMs through few-epoch fine-tuning. Based on the mechanisms formalized in these three stages, NetRoller enables specialized driving models to operate at their native frequencies while maintaining situational awareness of the GM. Experiments conducted on the nuScenes dataset demonstrate that integrating GM through NetRoller significantly improves human similarity and safety in planning tasks, and it also achieves noticeable precision improvements in detection and mapping tasks for end-to-end autonomous driving. The code and models are available at https://github.com/Rex-sys-hk/NetRoller .
Chinese: NetRoller是一种创新的适配器,通过三阶段机制将通用模型与自动驾驶专用模型整合,确保无缝操作,并在规划、检测和绘图任务中显著提升性能表现。
English: NetRoller is an innovative adapter that integrates General Models like LLMs with Specialized Models in autonomous driving by employing a three-stage mechanism to ensure seamless operation and enhanced performance across planning, detection, and mapping tasks.
Authors:David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal
Abstract:
Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.
中文:GenerationPrograms作为模块化生成框架,通过将文本生成分解为程序规划与执行两个阶段,在多项任务中显著提升了归因准确性和可解释性。
English: GenerationPrograms is a modular framework that decomposes text generation into program planning and execution stages, significantly improving attribution accuracy and interpretability across various tasks.
Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Abstract:
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO.
Chinese: 本研究通过将序列级PPO分解为令牌级问题,提出了一种将令牌级奖励指导融入直接偏好优化的方法,在多个基准测试中相比标准DPO实现了显著性能提升。
English: This study introduces a method to integrate token-level reward guidance into Direct Preference Optimization by decomposing sequence-level PPO into token-level problems, achieving significant performance gains over standard DPO across multiple benchmarks.
Authors:Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Abstract:
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.
中文摘要:AlphaDecay是一种自适应权重衰减方法,根据大语言模型各模块的光谱特性分配不同的衰减强度,相比传统均匀衰减方法能有效提升模型性能。
English Summary: AlphaDecay is an adaptive weight decay method that assigns varying decay strengths to different modules of large language models based on their spectral properties, improving performance over uniform decay approaches.
Authors:Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma
Abstract:
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at https://github.com/CYF-cuber/MOL.
中文摘要:本文提出了一种端到端的微动作感知深度学习框架,结合变换器、图卷积和普通卷积的优势,直接从原始视频帧中提取局部-全局特征,无需关键帧先验知识,在面部微表情识别任务中实现了最优性能。
English Summary: The paper introduces an end-to-end micro-action-aware deep learning framework that integrates transformer, graph, and vanilla convolutions to extract local-global features directly from raw video frames, achieving state-of-the-art performance in facial micro-expression recognition without requiring key frame annotations.
Authors:Nitesh Subedi, Adam Haroon, Shreyan Ganguly, Samuel T. K. Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar
Abstract:
Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with long-horizon planning and spatial reasoning. By providing this empirical baseline, we highlight both the capabilities and limitations of using foundation models as drop-in representations for embodied tasks, offering critical insights for robotics researchers facing practical design tradeoffs between system complexity and performance in resource-constrained scenarios. Our code is available at https://github.com/oadamharoon/text2nav
中文: 本研究提出了一种极简框架,直接在冻结的视觉语言嵌入上训练行为克隆策略,实现了74%的语言引导导航成功率,但与专家性能相比,揭示了其在长程规划和空间推理方面的局限性。
English: This study introduces a minimalist framework that trains a behavior cloning policy directly on frozen vision-language embeddings, achieving 74% success in language-guided navigation but revealing limitations in long-horizon planning and spatial reasoning compared to expert performance.
Authors:Paolo Franceschi, Marco Faroni, Stefano Baraldo, Anna Valente
Abstract:
This paper introduces the ROS2 control and the Hardware Interface (HW) integration for the Fanuc CRX- robot family. It explains basic implementation details and communication protocols, and its integration with the Moveit2 motion planning library. We conducted a series of experiments to evaluate relevant performances in the robotics field. We tested the developed ros2_fanuc_interface for four relevant robotics cases: step response, trajectory tracking, collision avoidance integrated with Moveit2, and dynamic velocity scaling, respectively. Results show that, despite a non-negligible delay between command and feedback, the robot can track the defined path with negligible errors (if it complies with joint velocity limits), ensuring collision avoidance. Full code is open source and available at https://github.com/paolofrance/ros2_fanuc_interface.
中文: 本文介绍了针对发那科CRX机器人系列的ROS2控制与硬件接口集成,详述了实现细节、通信协议及与Moveit2的整合,实验表明尽管存在轻微指令延迟,机器人仍能有效跟踪路径并实现避障。
English: This paper presents the ROS2 control and hardware interface integration for the Fanuc CRX robot family, detailing implementation, communication protocols, and Moveit2 integration, with experiments showing effective path tracking and collision avoidance despite minor command delays.
Authors:Jingqi Yang, Zhilong Song, Jiawei Chen, Mingli Song, Sheng Zhou, linjun sun, Xiaogang Ouyang, Chun Chen, Can Wang
Abstract:
The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation, explicitly incorporating seven common types of anomalies observed in everyday GUI interactions. Furthermore, we propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools and then generate corresponding step and task descriptions for these actions with the assistance of MLLMs. This paradigm significantly reduces annotation time cost by a factor of over 19 times. Finally, we assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios. We anticipate that our work will highlight the importance of robustness in GUI agents and inspires more future research in this direction. The dataset and code are available at https://github.com/chessbean1/GUI-Robust..
中文: GUI-Robust数据集通过RPA工具和MLLMs实现半自动化构建,专门包含真实GUI交互中的常见异常,将标注时间减少19倍以上,并揭示现有GUI代理在异常场景下性能显著下降的问题。
English: The GUI-Robust dataset introduces a semi-automated construction method using RPA tools and MLLMs to incorporate real-world GUI anomalies, reducing annotation time by over 19 times and revealing significant performance drops in current GUI agents under abnormal conditions.
Authors:Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu
Abstract:
Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.
中文摘要:本文提出MoORE方法,通过奇异值分解和可学习路由器构建正交专家模型,有效解决多任务适应中的任务冲突与遗忘问题,在各项实验中均优于现有方法。
English Summary: This paper introduces MoORE, a novel multi-task adaptation method that uses singular value decomposition and learnable routers to create orthogonal experts, effectively preventing task conflicts and preserving original task knowledge while outperforming existing approaches.
Authors:Eric Jeangirard
Abstract:
The transition to Open Science necessitates robust and reliable metadata. While national initiatives, such as the French Open Science Monitor, aim to track this evolution using open data, reliance on proprietary databases persists in many places. Open platforms like OpenAlex still require significant human intervention for data accuracy. This paper introduces Works-magnet, a project by the French Ministry of Higher Education and Research (MESR) Data Science & Engineering Team. Works-magnet is designed to accelerate the curation of bibliographic and research data metadata, particularly affiliations, by making automated AI calculations visible and correctable. It addresses challenges related to metadata heterogeneity, complex processing chains, and the need for human curation in a diverse research landscape. The paper details Works-magnet's concepts, and the observed limitations, while outlining future directions for enhancing open metadata quality and reusability. The works-magnet app is open source on github https://github.com/dataesr/works-magnet
中文:法国高等教育与研究部的Works-magnet项目利用人工智能自动优化文献元数据管理,通过可视化修正机制应对开放科学中的元数据异质性挑战,旨在提升数据质量与复用性。
English: The Works-magnet project by the French MESR team uses AI to automate and improve the curation of bibliographic metadata, addressing challenges in Open Science by making corrections visible and enhancing data quality and reusability.
Authors:Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Abstract:
Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably stable perplexity during direct context extrapolation. Moreover, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct local perception phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first length extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs. The code is available at https://github.com/OpenMOSS/LongLLaDA.
中文: 本研究首次系统分析了扩散大语言模型的长上下文能力,揭示了其在上下文外推中保持稳定困惑度的特性及独特的局部感知现象,并提出无需训练的LongLLaDA方法,验证了扩展上下文窗口的有效缩放规律。
English: This study conducts the first systematic analysis of long-context capabilities in diffusion LLMs, revealing their stable perplexity during context extrapolation and unique local perception phenomenon, while proposing LongLLaDA as an effective training-free method for context extension with validated scaling laws.
Authors:Huan Kang, Hui Li, Xiao-Jun Wu, Tianyang Xu, Rui Wang, Chunyang Cheng, Josef Kittler
Abstract:
In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces.
However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot
encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic
similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance.
While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To
address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible
image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the
Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into
high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic
fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on
a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high
correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both
qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.
中文摘要:本文提出基于格拉斯曼流形的新型注意力机制GrFormer,通过将特征解耦为高频细节和低频语义实现多尺度语义融合,在红外与可见光图像融合任务中定性定量均优于现有最优方法。
English Summary: This paper introduces GrFormer, a novel attention mechanism using Grassmann manifold to achieve multi-scale semantic fusion by separating image features into high-frequency details and low-frequency semantics, outperforming state-of-the-art methods in infrared and visible image fusion.
Authors:Xiaoqi Wang, Yi Wang, Lap-Pui Chau
Abstract:
Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .
中文摘要:EVA02-AT通过单阶段预训练实现了统一视频编码器,采用联合注意力增强时空编码,并引入对称多重相似度损失函数,以更少参数量在自我中心视频语言任务中达到最优性能。
English Summary: EVA02-AT introduces a unified video encoder through single-stage pretraining, enhanced spatial-temporal encoding with joint attention, and a Symmetric Multi-Similarity loss, achieving state-of-the-art performance in egocentric video-language tasks with fewer parameters.
Authors:Anas Abdelkarim, Holger Voos, Daniel Görges
Abstract:
Factor graphs have demonstrated remarkable efficiency for robotic perception tasks, particularly in localization and mapping applications. However, their application to optimal control problems -- especially Model Predictive Control (MPC) -- has remained limited due to fundamental challenges in constraint handling. This paper presents a novel integration of the Barrier Interior Point Method (BIPM) with factor graphs, implemented as an open-source extension to the widely adopted g2o framework. Our approach introduces specialized inequality factor nodes that encode logarithmic barrier functions, thereby overcoming the quadratic-form limitations of conventional factor graph formulations. To the best of our knowledge, this is the first g2o-based implementation capable of efficiently handling both equality and inequality constraints within a unified optimization backend. We validate the method through a multi-objective adaptive cruise control application for autonomous vehicles. Benchmark comparisons with state-of-the-art constraint-handling techniques demonstrate faster convergence and improved computational efficiency. (Code repository: https://github.com/snt-arg/bipm_g2o)
中文: 本文提出了一种将障碍内点法与因子图相结合的新方法,实现了在最优控制问题中高效处理等式和不等式约束,并通过自动驾驶应用验证了其更快的收敛速度和更高的计算效率。
English: This paper introduces a novel integration of the Barrier Interior Point Method with factor graphs, enabling efficient handling of both equality and inequality constraints in optimal control problems, validated through autonomous vehicle applications with improved computational efficiency.
Authors:Qingyu Song, Wei Lin, Juncheng Wang, Hong Xu
Abstract:
Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First, we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems, we also demonstrate that the L2O model's convergence rate in OOD scenarios will deteriorate by an equation of the L2O model's input features. Moreover, we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 $\times$ convergence speedup. The code of our method can be found from https://github.com/NetX-lab/GoMathL2O-Official.
中文: 本文针对学习优化模型在分布外场景中缺乏理论保证的问题,通过证明其收敛性并提出一种新颖的仅使用梯度特征的模型,实现了比现有方法快10倍的收敛速度。
English: This paper addresses the lack of theoretical guarantees for Learning to Optimize (L2O) models in out-of-distribution scenarios by providing proofs of their convergence rates and proposing a novel model with gradient-only features that achieves up to 10 times faster convergence than existing methods.
Authors:Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Fei Dai
Abstract:
Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model's performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.
Chinese: 训练生成式AI高度依赖高质量数据,而CausalDiffTab作为一种采用自适应因果正则化的扩散模型,通过处理异构数据类型和复杂变量关系,有效生成混合表格数据,在全面实验中表现优于基准方法。
English: Training generative AI heavily relies on high-quality data, and CausalDiffTab, a diffusion model with adaptive causal regularization, effectively generates mixed tabular data by addressing its heterogeneity and complex relationships, outperforming baselines in comprehensive experiments.
Authors:Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song
Abstract:
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth
中文:AgentSynth是一种可扩展且成本高效的流程,通过利用信息不对称和迭代子任务组合,自动为通用计算机使用代理生成多样化的真实任务,每条轨迹成本仅0.60美元,同时能显著挑战最先进的大语言模型代理——其任务成功率随难度提升从18%骤降至4%。
English: AgentSynth is a scalable and cost-effective pipeline that automatically generates diverse and realistic tasks for computer-use agents by leveraging information asymmetry and iterative subtask composition, achieving a low cost of $0.60 per trajectory while significantly challenging state-of-the-art LLM agents with performance dropping from 18% to 4% as task difficulty increases.
Authors:Juho Bai, Inwook Shim
Abstract:
Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50\% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: https://github.com/juho127/SceneAware.
中文摘要:SceneAware框架通过结合视觉变换器的场景理解和可通行性掩码,显著提升了行人轨迹预测的准确性,在基准数据集上性能超越现有方法50%以上,同时确保预测路径符合物理约束。
English Summary: The SceneAware framework enhances pedestrian trajectory prediction by integrating scene understanding through Vision Transformers and walkability masks, achieving over 50% improvement in accuracy on benchmark datasets while ensuring physically plausible paths.
Authors:Nafiz Sadman, Farhana Zulkernine, Benjamin Kwan
Abstract:
In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.
中文: 本研究评估了BiomedCLIP在非平衡医疗数据集上的表现,揭示了其零样本预测在精度和类别区分上的局限性,同时通过全微调和线性探测展示了改进效果,强调了模型需针对性优化以确保实际应用的可靠性。
English: This study evaluates BiomedCLIP's performance on an imbalanced medical dataset, revealing its zero-shot limitations in precision and class separation while demonstrating improvements through fine-tuning and linear probing, emphasizing the need for model adaptations to ensure real-world reliability.
Authors:Chunyu Cao, Jintao Cheng, Zeyu Chen, Linfan Zhan, Rui Fan, Zhijian He, Xiaoyu Tang
Abstract:
Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.
中文摘要:本文提出了一种基于逻辑的知识蒸馏框架用于运动目标分割,通过解耦运动与非运动类别并应用针对性蒸馏策略,在保持实时效率的同时显著提升精度,在SemanticKITTI-MOS数据集上获得了78.8%的交并比。
English Summary: This paper introduces a logits-based knowledge distillation framework for Motion Object Segmentation (MOS) that enhances accuracy while maintaining real-time efficiency by decoupling moving and non-moving classes and applying tailored distillation strategies, achieving a 78.8% IoU on SemanticKITTI-MOS.
Authors:Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Abstract:
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at https://github.com/SewoongLab/byte-sampler .
中文: 本文提出了一种推理时方法,可将采用BPE分词器的自回归语言模型转换为字符级或字节级模型,有效解决了提示边界问题,并实现了不同分词器模型间的集成与代理调优。
English: This paper introduces an inference-time method that converts autoregressive language models with BPE tokenizers into character-level or byte-level models, effectively solving the Prompt Boundary Problem and enabling model ensembling and proxy-tuning across different tokenizers.
Authors:Taehee Jeong
Abstract:
Retrieval-Augmented Generation (RAG) addresses limitations of large language models (LLMs) by leveraging a vector database to provide more accurate and up-to-date information. When a user submits a query, RAG executes a vector search to find relevant documents, which are then used to generate a response. However, ensuring the relevance of retrieved documents with a query would be a big challenge. To address this, a secondary model, known as a relevant grader, can be served to verify its relevance. To reduce computational requirements of a relevant grader, a lightweight small language model is preferred. In this work, we finetuned llama-3.2-1b as a relevant grader and achieved a significant increase in precision from 0.1301 to 0.7750. Its precision is comparable to that of llama-3.1-70b. Our code is available at https://github.com/taeheej/Lightweight-Relevance-Grader-in-RAG.
中文: 本研究通过微调轻量级模型llama-3.2-1b作为相关性评估器,将检索增强生成的精度从0.1301显著提升至0.7750,其性能可与更大模型相媲美。
English: This work enhances Retrieval-Augmented Generation by fine-tuning the lightweight llama-3.2-1b model as a relevance grader, significantly improving precision from 0.1301 to 0.7750 while maintaining performance comparable to larger models.
Authors:Xinglei Wang, Tao Cheng, Stephen Law, Zichao Zeng, Ilya Ilyankou, Junyuan Liu, Lu Yin, Weiming Huang, Natchapon Jongwiriyanurak
Abstract:
Predicting individuals' next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic context, and accommodate previously unseen locations. To address these challenges, we explore the application of CaLLiPer -- a multimodal representation learning framework that fuses spatial coordinates and semantic features of points of interest through contrastive learning -- for location embedding in individual mobility prediction. CaLLiPer's embeddings are spatially explicit, semantically enriched, and inductive by design, enabling robust prediction performance even in scenarios involving emerging locations. Through extensive experiments on four public mobility datasets under both conventional and inductive settings, we demonstrate that CaLLiPer consistently outperforms strong baselines, particularly excelling in inductive scenarios. Our findings highlight the potential of multimodal, inductive location embeddings to advance the capabilities of human mobility prediction systems. We also release the code and data (https://github.com/xlwang233/Into-the-Unknown) to foster reproducibility and future research.
中文:CaLLiPer框架通过融合空间坐标与语义特征的对比学习,构建了具有空间显式性和语义丰富性的位置嵌入,在多个公开移动数据集上验证了其优于传统方法的预测性能,特别在处理新出现地点时表现突出。
English: The CaLLiPer framework addresses limitations in traditional mobility prediction by creating multimodal, inductive location embeddings that integrate spatial and semantic data, demonstrating superior performance in both conventional and emerging location scenarios across multiple datasets.
Authors:Yash Vekaria, Yohan Beugin, Shaoor Munir, Gunes Acar, Nataliia Bielova, Steven Englehardt, Umar Iqbal, Alexandros Kapravelos, Pierre Laperdrix, Nick Nikiforakis, Jason Polakis, Franziska Roesner, Zubair Shafiq, Sebastian Zimmeck
Abstract:
Web tracking is a pervasive and opaque practice that enables personalized advertising, retargeting, and conversion tracking. Over time, it has evolved into a sophisticated and invasive ecosystem, employing increasingly complex techniques to monitor and profile users across the web. The research community has a long track record of analyzing new web tracking techniques, designing and evaluating the effectiveness of countermeasures, and assessing compliance with privacy regulations. Despite a substantial body of work on web tracking, the literature remains fragmented across distinctly scoped studies, making it difficult to identify overarching trends, connect new but related techniques, and identify research gaps in the field. Today, web tracking is undergoing a once-in-a-generation transformation, driven by fundamental shifts in the advertising industry, the adoption of anti-tracking countermeasures by browsers, and the growing enforcement of emerging privacy regulations. This Systematization of Knowledge (SoK) aims to consolidate and synthesize this wide-ranging research, offering a comprehensive overview of the technical mechanisms, countermeasures, and regulations that shape the modern and rapidly evolving web tracking landscape. This SoK also highlights open challenges and outlines directions for future research, aiming to serve as a unified reference and introductory material for researchers, practitioners, and policymakers alike.
中文: 网络追踪已发展为复杂的用户画像生态系统,本文通过系统化梳理相关研究,整合其技术机制、防护措施与法规现状,并指明未来研究挑战。
English: Web tracking has evolved into a complex ecosystem for user profiling, prompting this Systematization of Knowledge to consolidate research on its mechanisms, countermeasures, and regulations while identifying future challenges.
Authors:Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang
Abstract:
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
中文:SimpleDoc是一种轻量级检索增强框架,通过双重线索过滤和迭代推理增强文档证据检索能力,在四个DocVQA数据集上平均性能提升3.2%且检索页面更少。
English: SimpleDoc is a lightweight retrieval-augmented framework for Document Visual Question Answering that enhances evidence retrieval through dual-cue filtering and iterative reasoning, achieving a 3.2% average improvement on four datasets with fewer retrieved pages.
Authors:Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu
Abstract:
Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent ''fingerprints'' in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, a simple supervised classifier can reliably determine whether a model has undergone unlearning based solely on its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we show that forget-relevant prompts enable over 90% accuracy in detecting unlearning traces across all model sizes. Even with forget-irrelevant inputs, large LLMs maintain high detectability, demonstrating the broad applicability of unlearning trace detection. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned given an input query. Codes are available at https://github.com/OPTML-Group/Unlearn-Trace.
中文摘要:大语言模型的机器遗忘会在模型输出和内部激活中留下可检测的痕迹,使得遗忘事件能够被高精度识别,这揭示了一种新的安全风险:被遗忘的信息可能通过逆向工程被推测出来。
English Summary: Machine unlearning in large language models leaves detectable traces in model outputs and internal activations, enabling high-accuracy detection of unlearning events and revealing a new vulnerability where forgotten information could potentially be reverse-engineered.
Authors:Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He
Abstract:
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space.
Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible.
We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models.
ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.
中文: 北极长序列训练(ALST)方法通过优化单GPU和多GPU内存使用,支持Hugging Face模型开箱即用的数百万标记序列训练,相比32K基准实现了高达400倍的序列长度提升。
English: The Arctic Long Sequence Training (ALST) method enables out-of-box training of multi-million token sequences for Hugging Face models by optimizing single and multi-GPU memory usage, achieving up to 400x longer sequences than the 32K baseline.
Authors:Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao
Abstract:
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at \href{https://github.com/Shellorley0513/CriticTool}{https://github.com/Shellorley0513/CriticTool}.
中文: 随着大语言模型在复杂任务中使用工具时频繁出错,CRITICTOOL基准应运而生,它通过评估模型的错误处理与反思能力,为工具学习领域提供了新的研究方向。
English: Large language models' growing use of tools for complex tasks introduces various errors, prompting the development of CRITICTOOL, a benchmark that evaluates error handling and reflection abilities to advance tool learning.
Authors:Katherine Mao, Hongzhan Yu, Ruipeng Zhang, Igor Spasojevic, M Ani Hsieh, Sicun Gao, Vijay Kumar
Abstract:
Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset of collision-free geometric paths, we show that modeling architectures can effectively learn the patterns underlying time-optimal trajectories. We introduce a quantitative framework to analyze local analytic properties of the learned models, and link them to the Backward Reachable Tube of the geometric tracking controller. To enhance robustness, we propose a data augmentation scheme that applies random perturbations to the input paths. Compared to classical planners, our method achieves substantial speedups, and we validate its real-time feasibility on a hardware quadrotor platform. Experiments demonstrate that the learned models generalize to previously unseen path lengths. The code for our approach can be found here: https://github.com/maokat12/lbTOPPQuad
中文: 本研究开发了基于学习的模型来模拟四旋翼飞行器的时间最优轨迹规划器,通过数据增强提升鲁棒性,在硬件平台上验证了实时可行性,并实现了轨迹生成速度的显著提升。
English: This research develops learning-based models that emulate a time-optimal trajectory planner for quadrotors, achieving significant acceleration in trajectory generation while maintaining robustness through data augmentation and demonstrating real-time feasibility on hardware.
Authors:Christel Sirocchi, Damiano Verda
Abstract:
In domains where transparency and trustworthiness are crucial, such as healthcare, rule-based systems are widely used and often preferred over black-box models for decision support systems due to their inherent interpretability. However, as rule-based models grow complex, discerning crucial features, understanding their interactions, and comparing feature contributions across different rule sets becomes challenging. To address this, we propose a comprehensive framework for estimating feature contributions in rule-based systems, introducing a graph-based feature visualisation strategy, a novel feature importance metric agnostic to rule-based predictors, and a distance metric for comparing rule sets based on feature contributions. By experimenting on two clinical datasets and four rule-based methods (decision trees, logic learning machines, association rules, and neural networks with rule extraction), we showcase our method's capability to uncover novel insights on the combined predictive value of clinical features, both at the dataset and class-specific levels. These insights can aid in identifying new risk factors, signature genes, and potential biomarkers, and determining the subset of patient information that should be prioritised to enhance diagnostic accuracy. Comparative analysis of the proposed feature importance score with state-of-the-art methods on 15 public benchmarks demonstrates competitive performance and superior robustness. The method implementation is available on GitHub: https://github.com/ChristelSirocchi/rule-graph.
中文: 本研究提出了一种全面的规则系统特征贡献分析框架,通过基于图的可视化和新型度量方法提升复杂模型(如医疗领域应用)的可解释性,在临床数据集和公共基准测试中验证了其优越的稳健性能。
English: This study introduces a comprehensive framework for analyzing feature contributions in rule-based systems, featuring graph-based visualization and novel metrics to enhance interpretability in complex models like those used in healthcare, validated on clinical datasets and public benchmarks with robust performance.
Authors:Ryuki Matsuura, Shikhar Bharadwaj, Jiarui Liu, Dhatchi Kunde Govindarajan
Abstract:
We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1
中文: 本研究开发了一种面向任务的口语对话系统,通过情感分析和情境化语音合成技术提升新闻对话的共情能力,实验证明其在情感调节和用户参与度方面优于基线系统,并建立了相应的主观评价标准。
English: This study introduces a task-oriented spoken dialogue system that leverages sentiment analysis and emotional speech synthesis to enhance empathetic engagement in news conversations, demonstrating superior emotion regulation and user engagement compared to baseline systems through proposed evaluation metrics.
Authors:Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen
Abstract:
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Code and data will be released at https://github.com/Visualignment/GAN-RM.
Chinese: 本文提出GAN-RM,一种高效奖励建模框架,通过仅使用少量目标样本与模型生成输出进行判别训练,无需人工偏好标注和显式质量维度设计,并在样本筛选和微调等关键应用中验证了其有效性。
English: This paper introduces GAN-RM, an efficient reward modeling framework that eliminates the need for manual preference annotations and explicit quality engineering by training on a small set of target samples and model-generated outputs, demonstrating effectiveness across various applications like sample filtering and fine-tuning.
Authors:Miho Koda, Yu Zheng, Ruixian Ma, Mingyang Sun, Devesh Pansare, Fabio Duarte, Paolo Santi
Abstract:
Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation -- leaving open the question of whether such reasoning skills generalize to complex, real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs' reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistical constraints. The benchmark comprises over 300 carefully crafted queries of varying difficulty levels, supported by a sandbox environment with in-house tools for constraint-based location search. Extensive evaluations reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct code-generation prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at https://github.com/miho-koda/LocationReasoner.
中文: 本文提出LocationReasoner基准,用于评估大语言模型在现实世界选址中的推理能力,发现即使如OpenAI o4等先进模型在处理复杂空间和物流约束时仍表现不佳,其表现甚至常逊于非推理模型。
English: This paper introduces LocationReasoner, a benchmark for evaluating large language models' reasoning in real-world site selection, revealing that even advanced models like OpenAI o4 struggle significantly with complex spatial and logistical constraints, often performing worse than non-reasoning models.
Authors:Miho Koda, Yu Zheng, Ruixian Ma, Mingyang Sun, Devesh Pansare, Fabio Duarte, Paolo Santi
Abstract:
Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation, leaving open the question of whether such reasoning skills generalize to complex real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs' reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistic constraints. The benchmark covers carefully crafted queries of varying difficulty levels and is supported by a sandbox environment with in-house tools for constraint-based location search. Automated verification further guarantees the scalability of the benchmark, enabling the addition of arbitrary number of queries. Extensive evaluations on real-world site selection data from Boston, New York, and Tampa reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at https://github.com/miho-koda/LocationReasoner.
中文: 本文提出LocationReasoner基准,用于评估大语言模型在现实世界选址中的推理能力,发现即使如OpenAI o4等先进模型在处理复杂空间和物流约束时仍表现不佳,其表现甚至常逊于非推理模型。
English: This paper introduces LocationReasoner, a benchmark for evaluating large language models' reasoning in real-world site selection, revealing that even advanced models like OpenAI o4 struggle significantly with complex spatial and logistical constraints, often performing worse than non-reasoning models.
Authors:Florian Kofler, Marcel Rosier, Mehdi Astaraki, Ujjwal Baid, Hendrik Möller, Josef A. Buchner, Felix Steinbauer, Eva Oswald, Ezequiel de la Rosa, Ivan Ezhov, Constantin von See, Jan Kirschke, Anton Schmick, Sarthak Pati, Akis Linardos, Carla Pitarch, Sanyukta Adap, Jeffrey Rudie, Maria Correia de Verdier, Rachit Saluja, Evan Calabrese, Dominic LaBella, Mariam Aboian, Ahmed W. Moawad, Nazanin Maleki, Udunna Anazodo, Maruf Adewole, Marius George Linguraru, Anahita Fathi Kazerooni, Zhifan Jiang, Gian Marco Conte, Hongwei Li, Juan Eugenio Iglesias, Spyridon Bakas, Benedikt Wiestler, Marie Piraud, Bjoern Menze
Abstract:
The Brain Tumor Segmentation (BraTS) cluster of challenges has significantly advanced brain tumor image analysis by providing large, curated datasets and addressing clinically relevant tasks. However, despite its success and popularity, algorithms and models developed through BraTS have seen limited adoption in both scientific and clinical communities. To accelerate their dissemination, we introduce BraTS orchestrator, an open-source Python package that provides seamless access to state-of-the-art segmentation and synthesis algorithms for diverse brain tumors from the BraTS challenge ecosystem. Available on GitHub (https://github.com/BrainLesion/BraTS), the package features intuitive tutorials designed for users with minimal programming experience, enabling both researchers and clinicians to easily deploy winning BraTS algorithms for inference. By abstracting the complexities of modern deep learning, BraTS orchestrator democratizes access to the specialized knowledge developed within the BraTS community, making these advances readily available to broader neuro-radiology and neuro-oncology audiences.
Chinese: BraTS orchestrator 是一个开源Python包,它简化了对BraTS挑战赛中先进脑肿瘤分割与合成算法的访问,使研究人员和临床医生能够轻松部署这些技术,无需深厚的编程经验。
English: The BraTS orchestrator is an open-source Python package that simplifies access to advanced brain tumor segmentation and synthesis algorithms from the BraTS challenges, enabling easy deployment for researchers and clinicians with minimal programming skills.
Authors:Boshen Shi, Yongqing Wang, Fangda Guo, Jiangli Shao, Huawei Shen, Xueqi Cheng
Abstract:
Transferring extensive knowledge from relevant social networks has emerged as a promising solution to overcome label scarcity in detecting social bots and other anomalies with GNN-based models. However, effective transfer faces two critical challenges. Firstly, the network heterophily problem, which is caused by bots hiding malicious behaviors via indiscriminately interacting with human users, hinders the model's ability to learn sufficient and accurate bot-related knowledge from source domains. Secondly, single-source transfer might lead to inferior and unstable results, as the source network may embody weak relevance to the task and provide limited knowledge. To address these challenges, we explore multiple source domains and propose a multi-source graph domain adaptation model named \textit{BotTrans}. We initially leverage the labeling knowledge shared across multiple source networks to establish a cross-source-domain topology with increased network homophily. We then aggregate cross-domain neighbor information to enhance the discriminability of source node embeddings. Subsequently, we integrate the relevance between each source-target pair with model optimization, which facilitates knowledge transfer from source networks that are more relevant to the detection task. Additionally, we propose a refinement strategy to improve detection performance by utilizing semantic knowledge within the target domain. Extensive experiments on real-world datasets demonstrate that \textit{BotTrans} outperforms the existing state-of-the-art methods, revealing its efficacy in leveraging multi-source knowledge when the target detection task is unlabeled.
Chinese: BotTrans是一种多源图域自适应模型,通过利用多个源网络间的共享标签知识增强网络同质性,并整合源域与目标域的相关性,在目标域无标签情况下有效提升社交机器人检测性能。
English: BotTrans is a multi-source graph domain adaptation model that enhances social bot detection by leveraging shared labeling knowledge across multiple source networks to increase network homophily and integrating source-target relevance for effective knowledge transfer when target labels are unavailable.
Authors:Zongxian Yang, Jiayu Qian, Zegao Peng, Haoyu Zhang, Zhi-An Huang
Abstract:
Large reasoning models have recently made significant strides in mathematical and code reasoning, yet their success has not transferred smoothly to the medical domain. While multiple factors contribute to this disparity, a critical issue is the inadequate focus on the quality of intermediate reflection steps, which is particularly crucial in high-stakes medical scenarios. To address this challenge, we propose Med-REFL, a \underline{\textbf{Med}}ical \underline{\textbf{R}}easoning \underline{\textbf{E}}nhancement via self-corrected \underline{\textbf{F}}ine-grained ref\underline{\textbf{L}}ection. Our method leverages a tree-of-thought approach to decompose medical questions into fine-grained reasoning paths, quantitatively evaluating each step and its subsequent reflections. These assessments enable automatic construction of direct preference optimization data, reducing reliance on expensive expert annotations while guiding models to identify and correct reasoning errors. Experimental results on the MedQA-USMLE benchmark demonstrate Med-REFL achieves consistent improvements, with average gains up to 4.11\%. Notably, it further boosts the state-of-the-art performance of 7B/8B models by an additional 4.13\%. Furthermore, Med-REFL exhibits strong generalization capabilities and robustness across several challenging medical question-answering datasets. Our work illustrates that prioritizing reflection quality leads to more accurate and trustworthy reasoning in medical AI applications. Checkpoints, code, and data can be found in https://github.com/TianYin123/Med-REFL.
中文: 提出的Med-REFL方法通过将医学问题分解为细粒度推理路径并进行自我修正,在减少对专家标注依赖的同时,显著提升了多个医学问答基准的性能表现。
English: The proposed Med-REFL method enhances medical reasoning by decomposing questions into fine-grained steps with self-correction, achieving significant performance gains on benchmarks while reducing reliance on expert annotations.
Authors:Ke Wang, Bo Pan, Yingchaojie Feng, Yuwei Wu, Jieyi Chen, Minfeng Zhu, Wei Chen
Abstract:
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)'s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usually face challenges in analyzing the effectiveness of GraphRAG on their dataset due to GraphRAG's complex information processing pipeline and the overwhelming amount of LLM invocations involved during graph construction and query, which limits GraphRAG interpretability and accessibility. This research proposes a visual analysis framework that helps RAG developers identify critical recalls of GraphRAG and trace these recalls through the GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype system incorporating a set of interactive visualizations to facilitate users' analysis process, boosting failure cases collection and improvement opportunities identification. Our evaluation demonstrates the effectiveness and usability of our approach. Our work is open-sourced and available at https://github.com/Gk0Wk/XGraphRAG.
中文: GraphRAG通过图结构提升大语言模型的知识检索能力,但其复杂性限制了分析,因此本研究提出XGraphRAG可视化框架,以增强可解释性和可用性。
English: GraphRAG enhances LLM responses by using a graph structure for better knowledge retrieval, but its complexity hinders analysis, so this research introduces XGraphRAG, a visual framework to improve interpretability and accessibility.
Authors:Kevin L. Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande
Abstract:
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines
中文摘要:本立场文件主张在基础模型评估中采用更严谨透明的人类基线以实现有效的人机性能比较,通过提供建议框架和报告清单来改进现有基线方法的不足。
English Summary: This position paper advocates for more rigorous and transparent human baselines in foundation model evaluations to enable accurate human-AI performance comparisons, offering a framework with recommendations and a reporting checklist to address current methodological shortcomings.
Authors:Runpeng Yu, Qi Li, Xinchao Wang
Abstract:
In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output control, and dynamic perception. These capabilities are previously difficult to achieve with AR models. A growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10$\times$ acceleration in inference speed. These developments position discrete diffusion models as a promising alternative to intelligence based on the traditional autoregressive approach. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, list commonly-used modeling methods, and categorize representative models. We further analyze key techniques for training, inference, quantization. We also discuss the trustworthy issues and summarize emerging applications across language, vision-language, and biological domains and etc.. We conclude by discussing future directions for research and deployment. Relative papers are collected in https://github.com/LiQiiiii/Awesome-Discrete-Diffusion-LLM_MLLM
本文系统综述了离散扩散语言与多模态模型,阐明了其相比自回归模型在并行解码速度、精细化控制和动态感知方面的优势,并剖析了其理论框架、关键技术及应用领域。
This survey comprehensively explores discrete diffusion language and multimodal models, highlighting their parallel decoding advantages over autoregressive models in speed, control, and perception while analyzing their frameworks, techniques, and applications.
Authors:Junyan Li, Wenshuo Zhao, Yang Zhang, Chuang Gan
Abstract:
Recent deep-thinking large language models often reason extensively to improve performance, but such lengthy reasoning is not always desirable, as it incurs excessive inference costs with disproportionate performance gains. Controlling reasoning length without sacrificing performance is therefore important, but remains challenging, especially under tight thinking budgets. We propose budget guidance, a simple yet effective method for steering the reasoning process of LLMs toward a target budget without requiring any LLM fine-tuning. Our approach introduces a lightweight predictor that models a Gamma distribution over the remaining thinking length during next-token generation. This signal is then used to guide generation in a soft, token-level manner, ensuring that the overall reasoning trace adheres to the specified thinking budget. Budget guidance enables natural control of the thinking length, along with significant token efficiency improvements over baseline methods on challenging math benchmarks. For instance, it achieves up to a 26% accuracy gain on the MATH-500 benchmark under tight budgets compared to baseline methods, while maintaining competitive accuracy with only 63% of the thinking tokens used by the full-thinking model. Budget guidance also generalizes to broader task domains and exhibits emergent capabilities, such as estimating question difficulty. The source code is available at: https://github.com/UMass-Embodied-AGI/BudgetGuidance.
中文摘要:预算引导是一种无需微调即可有效控制大语言模型推理长度的方法,在严格预算下显著提升令牌效率与任务性能。
English Summary: Budget guidance is a novel method that enables large language models to control reasoning length effectively without fine-tuning, achieving significant token efficiency and performance gains under tight budgets.
Authors:Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, Xinchao Wang
Abstract:
Dense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce Test3R, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint. Code is available at https://github.com/nopQAQ/Test3R.
中文: Test3R是一种简单的测试时学习方法,通过使用图像三元组优化网络,最大化共享图像对重建间的几何一致性,从而显著提升三维重建精度。
English: Test3R is a simple test-time learning method that enhances 3D reconstruction accuracy by optimizing networks using image triplets to maximize geometric consistency between reconstructions from shared image pairs.
Authors:Spiros Gkousis, Evina Katsou
Abstract:
This article describes lcpy, an open-source python package that allows for advanced parametric Life Cycle Assessment (LCA) and Life Cycle Costing (LCC) analysis. The package is designed to allow the user to model a process with a flexible, modular design based on dictionaries and lists. The modeling can consider in-time variations, uncertainty, and allows for dynamic analysis, uncertainty assessment, as well as conventional static LCA and LCC. The package is compatible with optimization and uncertainty analysis libraries as well as python packages for prospective LCA. Its goal is to allow for easy implementation of dynamic LCA and LCC and for simple integration with tools for uncertainty assessment and optimization towards a more widened implementation of advanced enviro-economic analysis. The open-source code can be found at https://github.com/spirdgk/lcpy.
中文: lcpy是一个开源Python软件包,用于高级参数化生命周期评估和生命周期成本分析,支持动态、不确定性和静态分析,具有模块化设计并能与优化库兼容。
English: The lcpy package is an open-source Python tool for advanced parametric Life Cycle Assessment and Life Cycle Costing, enabling dynamic, uncertain, and static analyses with modular design and compatibility with optimization libraries.
Authors:Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Abstract:
Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference. In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss. Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification). Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
中文摘要:本文提出了一个利用层级相关性传播进行归因引导剪枝的统一框架,能够在大幅压缩大语言模型规模的同时保持性能,并实现核心功能电路发现和模型安全修正。
English Summary: This paper presents a holistic framework using Layer-wise Relevance Propagation for attribution-guided pruning of Large Language Models, enabling efficient model compression, circuit discovery, and safety correction while maintaining performance.
Authors:Junfeng Fang, Zijun Yao, Ruipeng Wang, Haokai Ma, Xiang Wang, Tat-Seng Chua
Abstract:
The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems. These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions. In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems. To establish our position, we argue with three key parts. (1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems. (2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial. (3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems. In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction. We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction. Our code is available at https://github.com/littlelittlenine/SafeMCP.git.
中文: 模型上下文协议(MCP)作为大语言模型与外部服务交互的标准,引入了第三方恶意服务的安全风险,因此呼吁研究界关注并开发安全的MCP驱动智能体系统。
English: The emergence of model context protocol (MCP) as a standard for LLM interactions with external services introduces significant safety risks from potentially malicious third-party providers, prompting a call for research into developing secure MCP-powered agent systems.
Authors:Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
Abstract:
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
中文: 本文提出Stream-Omni大语言-视觉-语音模型,通过序列维度拼接实现视觉-文本对齐、基于CTC的层维度映射实现语音-文本对齐,以更少数据需求在多模态任务中实现优异性能。
English: This paper introduces Stream-Omni, a large language-vision-speech model that achieves efficient modality alignments through tailored methods—sequence-dimension concatenation for vision-text and CTC-based mapping for speech-text—enabling strong performance across multimodal tasks with reduced data requirements.
Authors:Bohao Yang, Hainiu Xu, Jinhua Du, Ze Li, Yulan He, Chenghua Lin
Abstract:
A compelling portrayal of characters is essential to the success of narrative writing. For readers, appreciating a character's traits requires the ability to infer their evolving beliefs, desires, and intentions over the course of a complex storyline, a cognitive skill known as Theory-of-Mind (ToM). Performing ToM reasoning in prolonged narratives requires readers to integrate historical context with current narrative information, a task at which humans excel but Large Language Models (LLMs) often struggle. To systematically evaluate LLMs' ToM reasoning capability in long narratives, we construct LitCharToM, a benchmark of character-centric questions across four ToM dimensions from classic literature. Further, we introduce EvolvTrip, a perspective-aware temporal knowledge graph that tracks psychological development throughout narratives. Our experiments demonstrate that EvolvTrip consistently enhances performance of LLMs across varying scales, even in challenging extended-context scenarios. EvolvTrip proves to be particularly valuable for smaller models, partially bridging the performance gap with larger LLMs and showing great compatibility with lengthy narratives. Our findings highlight the importance of explicit representation of temporal character mental states in narrative comprehension and offer a foundation for more sophisticated character understanding. Our data and code are publicly available at https://github.com/Bernard-Yang/EvolvTrip.
Chinese: 本研究引入LitCharToM基准和EvolvTrip知识图谱来提升大语言模型在长篇叙事中的心理理论推理能力,实验表明明确追踪角色心理状态能显著增强模型表现,尤其对小型模型效果更为明显。
English: The study introduces LitCharToM and EvolvTrip to enhance LLMs' Theory-of-Mind reasoning in long narratives, showing that explicit tracking of character mental states significantly improves model performance, especially for smaller models.
Authors:Zhiyi Shi, Binjie Wang, Chongjie Si, Yichen Wu, Junsik Kim, Hanspeter Pfister
Abstract:
Model editing aims to efficiently update a pre-trained model's knowledge without the need for time-consuming full retraining. While existing pioneering editing methods achieve promising results, they primarily focus on editing single-modal language models (LLMs). However, for vision-language models (VLMs), which involve multiple modalities, the role and impact of each modality on editing performance remain largely unexplored. To address this gap, we explore the impact of textual and visual modalities on model editing and find that: (1) textual and visual representations reach peak sensitivity at different layers, reflecting their varying importance; and (2) editing both modalities can efficiently update knowledge, but this comes at the cost of compromising the model's original capabilities. Based on our findings, we propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers. Additionally, we introduce a gating module within the more sensitive textual modality, allowing DualEdit to efficiently update new knowledge while preserving the model's original information. We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines as well as adapted LLM editing methods on different evaluation metrics. Codes are available at https://github.com/zhiyiscs/DualEdit
Chinese Summary: 针对视觉语言模型(VLM)的模型编辑需兼顾文本和视觉模态,DualEdit通过在各模态关键层进行修改,并采用门控机制保护原始信息,实现了高效的知识更新与性能保持。
English Summary: Model editing for vision-language models (VLMs) requires addressing both textual and visual modalities, with DualEdit effectively updating knowledge by targeting key layers in each modality while preserving original capabilities through a gating mechanism.
Authors:Chia-Heng Yu, Yen-Lung Tsai
Abstract:
Traditional Retrieval-Augmented Generation (RAG) systems employ brute-force inner product search to retrieve the top-k most similar documents, then combined with the user query and passed to a language model. This allows the model to access external knowledge and reduce hallucinations. However, selecting an appropriate k value remains a significant challenge in practical applications: a small k may fail to retrieve sufficient information, while a large k can introduce excessive and irrelevant content. To address this, we propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k. Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content. In the experiment stage, we applied our method to a Taiwanese legal dataset with expert-graded queries. The results show that our approach achieves superior performance in expert evaluations and maintains high precision while eliminating the need to predefine k, demonstrating improved accuracy and interpretability in legal text retrieval tasks. Our framework is simple to implement and easily integrates with existing RAG pipelines, making it a practical solution for real-world applications under limited resources.
中文摘要:本研究提出的基于层次聚类的检索方法无需预设k值,能自适应选择语义相关内容,在保持系统响应准确性的同时提升了法律文本检索的可解释性,且易于与现有RAG系统集成。
English Summary: The proposed hierarchical clustering-based retrieval method for RAG systems eliminates the need to predefine the k value, adaptively selecting relevant content to maintain high accuracy and interpretability while being easily integrable with existing pipelines.
Authors:Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao
Abstract:
Multimodal Large Language Models (MLLMs) struggle with long videos due to fixed context windows and weak long-term dependency modeling. Existing Retrieval-Augmented Generation (RAG) methods for videos use static retrieval strategies, leading to inefficiencies for simple queries and information loss for complex tasks. To address this, we propose AdaVideoRAG, a novel framework that dynamically adapts retrieval granularity based on query complexity using a lightweight intent classifier. Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs, enabling optimal resource allocation across tasks. We also introduce the HiVU benchmark for comprehensive evaluation. Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs. AdaVideoRAG establishes a new paradigm for adaptive retrieval in video analysis. Codes will be open-sourced at https://github.com/xzc-zju/AdaVideoRAG.
Chinese: AdaVideoRAG提出了一种动态检索框架,根据查询复杂度自适应调整检索粒度,通过分层索引和轻量级分类器提升多模态大语言模型在长视频理解中的效率与准确性。
English: AdaVideoRAG introduces a dynamic retrieval framework that adapts granularity based on query complexity, utilizing hierarchical indexing and a lightweight classifier to enhance efficiency and accuracy in long-video understanding for MLLMs.
Authors:MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun
Abstract:
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
MiniMax-M1是全球首个开放权重的混合注意力推理模型,具备百万令牌上下文和闪电注意力机制,通过仅需三周的高效强化学习在复杂任务中实现卓越性能。
MiniMax-M1 is the world's first open-weight hybrid-attention reasoning model featuring a million-token context and lightning attention mechanism, achieving superior performance in complex tasks through efficient reinforcement learning completed in just three weeks.
Authors:Jonathan Hoss, Felix Schelling, Noah Klarmann
Abstract:
The classical Job Shop Scheduling Problem (JSSP) focuses on optimizing makespan under deterministic constraints. Real-world production environments introduce additional complexities that cause traditional scheduling approaches to be less effective. Reinforcement learning (RL) holds potential in addressing these challenges, as it allows agents to learn adaptive scheduling strategies. However, there is a lack of a comprehensive, general-purpose frameworks for effectively training and evaluating RL agents under real-world constraints. To address this gap, we propose a modular framework that extends classical JSSP formulations by incorporating key real-world constraints inherent to the shopfloor, including transport logistics, buffer management, machine breakdowns, setup times, and stochastic processing conditions, while also supporting multi-objective optimization. The framework is a customizable solution that offers flexibility in defining problem instances and configuring simulation parameters, enabling adaptation to diverse production scenarios. A standardized interface ensures compatibility with various RL approaches, providing a robust environment for training RL agents and facilitating the standardized comparison of different scheduling methods under dynamic and uncertain conditions. We release JobShopLab as an open-source tool for both research and industrial applications, accessible at: https://github.com/proto-lab-ro/jobshoplab
中文:本文提出了JobShopLab这一模块化开源框架,通过整合实际生产约束和多目标优化来扩展经典作业车间调度问题,为强化学习智能体训练和动态条件下调度方法的标准化比较提供了统一环境。
English: This paper introduces JobShopLab, a modular and open-source framework that extends classical job shop scheduling by incorporating real-world constraints and multi-objective optimization, providing a standardized environment for training reinforcement learning agents and comparing scheduling methods under dynamic conditions.
Authors:YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt
Abstract:
$E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo et al. (2024) recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Code is available at https://github.com/atomicarchitects/PriceofFreedom.
中文: E(3)等变神经网络中的张量积操作虽经优化提速却以牺牲表达能力为代价,我们的研究发现理论运行时间与实测性能差异显著,并提出一种简化的球面网格方法,在实际训练中效率提升30%。
English: E(3)-equivariant neural networks rely on tensor products for geometric feature interactions, but current optimizations like the Gaunt tensor product sacrifice expressivity for speed, with our analysis revealing that theoretical runtime claims often mismatch empirical performance and proposing a simplified spherical grid method that boosts training efficiency by 30%.
Authors:Yihui Li, Chengxin Lv, Hongyu Yang, Di Huang
Abstract:
Reconstructing 3D scenes from unconstrained image collections poses significant challenges due to variations in appearance. In this paper, we propose Scalable Micro-macro Wavelet-based Gaussian Splatting (SMW-GS), a novel method that enhances 3D reconstruction across diverse scales by decomposing scene representations into global, refined, and intrinsic components. SMW-GS incorporates the following innovations: Micro-macro Projection, which enables Gaussian points to sample multi-scale details with improved diversity; and Wavelet-based Sampling, which refines feature representations using frequency-domain information to better capture complex scene appearances. To achieve scalability, we further propose a large-scale scene promotion strategy, which optimally assigns camera views to scene partitions by maximizing their contributions to Gaussian points, achieving consistent and high-quality reconstructions even in expansive environments. Extensive experiments demonstrate that SMW-GS significantly outperforms existing methods in both reconstruction quality and scalability, particularly excelling in large-scale urban environments with challenging illumination variations. Project is available at https://github.com/Kidleyh/SMW-GS.
中文摘要:SMW-GS提出了一种创新的3D重建方法,通过微宏观投影和小波采样技术实现多尺度细节优化,在大规模城市场景中显著提升了复杂光照条件下的重建质量与扩展性。
English Summary: SMW-GS introduces a novel 3D reconstruction method using micro-macro projection and wavelet-based sampling to achieve superior multi-scale detail capture and scalability, particularly excelling in large-scale urban environments with challenging lighting conditions.
Authors:Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
Abstract:
As Large Language Models (LLMs) increasingly power applications used by children and adolescents, ensuring safe and age-appropriate interactions has become an urgent ethical imperative. Despite progress in AI safety, current evaluations predominantly focus on adults, neglecting the unique vulnerabilities of minors engaging with generative AI. We introduce Safe-Child-LLM, a comprehensive benchmark and dataset for systematically assessing LLM safety across two developmental stages: children (7-12) and adolescents (13-17). Our framework includes a novel multi-part dataset of 200 adversarial prompts, curated from red-teaming corpora (e.g., SG-Bench, HarmBench), with human-annotated labels for jailbreak success and a standardized 0-5 ethical refusal scale. Evaluating leading LLMs -- including ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, and Mistral -- we uncover critical safety deficiencies in child-facing scenarios. This work highlights the need for community-driven benchmarks to protect young users in LLM interactions. To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/The-Responsible-AI-Initiative/Safe_Child_LLM_Benchmark.git
中文摘要:Safe-Child-LLM基准通过包含对抗性提示和标准化评估指标的全面框架,揭示了主流大语言模型在面向儿童场景中的重大安全隐患,填补了未成年人AI安全评估的空白。
English Summary: The Safe-Child-LLM benchmark addresses critical safety gaps in LLMs for young users by introducing a comprehensive evaluation framework with adversarial prompts and standardized metrics, revealing significant vulnerabilities in child-facing AI interactions.
Authors:José A. Pardo, Alicia Gómez-Pascual, José T. Palma, Juan A. BotÃa
Abstract:
The growing volume of omics and clinical data generated for neurodegenerative diseases (NDs) requires new approaches for their curation so they can be ready-to-use in bioinformatics. NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples. The NeuroEmbed method comprises four stages: (1) extraction of ND cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical ontologies and clustering on the embedding space; (3) automated generation of a natural language question-answering (QA) dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions and (4) fine-tuning of a domain-specific embedder to optimize queries. We illustrate the approach using the GEO repository and the PubMedBERT pretrained embedder. Applying NeuroEmbed, we semantically indexed 2,801 repositories and 150,924 samples. Amongst many biology-relevant categories, we normalized more than 1,700 heterogeneous tissue labels from GEO into 326 unique ontology-aligned concepts and enriched annotations with new ontology-aligned terms, leading to a fold increase in size for the metadata terms between 2.7 and 20 fold. After fine-tuning PubMedBERT with the QA training data augmented with the enlarged metadata, the model increased its mean Retrieval Precision from 0.277 to 0.866 and its mean Percentile Rank from 0.355 to 0.896. The NeuroEmbed methodology for the creation of electronic catalogues of omics cohorts and samples will foster automated bioinformatic pipelines construction. The NeuroEmbed catalogue of cohorts and samples is available at https://github.com/JoseAdrian3/NeuroEmbed.
中文: NeuroEmbed是一种创新方法,通过本体对齐和优化嵌入器为神经退行性疾病队列及样本构建语义精准的嵌入空间,显著提升元数据标准化程度与检索精度,从而推动生物信息学流程的自动化构建。
English: NeuroEmbed is a novel method that creates semantically precise embedding spaces for neurodegenerative disease cohorts and samples, enhancing metadata normalization and retrieval precision through ontology alignment and fine-tuned embedders, thereby facilitating automated bioinformatics workflows.
Authors:Zerui Gong, Zhonghua Wu, Qingyi Tao, Qinyue Li, Chen Change Loy
Abstract:
Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure. Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Our code and benchmark are available at https://github.com/Ry3nG/SA-LUT
中文摘要:提出的空间自适应4D查找表(SA-LUT)结合了LUT的高效性与神经网络的适应性,在保持实时性能的同时,实现了优于现有方法的超写实风格迁移效果。
English Summary: The proposed Spatial Adaptive 4D Look-Up Table (SA-LUT) combines LUT efficiency with neural network adaptability to achieve superior photorealistic style transfer, outperforming existing methods while maintaining real-time performance.
Authors:Laiyan Ding, Hualie Jiang, Jiwei Chen, Rui Huang
Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
Chinese: SelfToF框架是一种自监督学习方法,通过融合高分辨率RGB图像来增强低分辨率深度图,无需真实深度数据即可实现鲁棒且尺度感知的深度估计。
English: The SelfToF framework is a self-supervised learning method that enhances low-resolution depth maps by integrating high-resolution RGB images, achieving robust and scale-aware depth estimation without requiring groundtruth depth data.
Authors:Kang Chen, Bin Huang, Xuebin Yang, Junyan Zhang, Yongbo Wang, Qiegen Liu
Abstract:
Synthetic CT projection data is crucial for advancing imaging research, yet its generation remains challenging. Current image domain methods are limited as they cannot simulate the physical acquisition process or utilize the complete statistical information present in projection data, restricting their utility and fidelity. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that operate in the image domain, PRO learns rich structural representations from projection data and leverages anatomical text prompts for controllable synthesis. Projection data generation models can utilize complete measurement signals and simulate the physical processes of scanning, including material attenuation characteristics, beam hardening, scattering, and projection geometry, and support research on downstream imaging tasks. Moreover, PRO functions as a foundation model, capable of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves performance across multiple downstream tasks, including low-dose and sparse-view reconstruction. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO.
中文: PRO是首个在投影域进行合成的CT成像基础模型,通过学习投影数据的结构表征并利用解剖文本提示实现可控合成,能有效模拟扫描物理过程并显著提升低剂量重建等下游任务的性能。
English: PRO is a pioneering projection domain synthesis foundation model for CT imaging that generates synthetic projection data by learning structural representations and utilizing anatomical text prompts, enabling controllable synthesis and improved performance in downstream tasks like low-dose reconstruction.
Authors:Jiang Wang, Yaozhong Kang, Linya Fu, Kazuhiro Nakadai, He Kong
Abstract:
Accurate calibration of sensor extrinsic parameters for ground robotic systems (i.e., relative poses) is crucial for ensuring spatial alignment and achieving high-performance perception. However, existing calibration methods typically require complex and often human-operated processes to collect data. Moreover, most frameworks neglect acoustic sensors, thereby limiting the associated systems' auditory perception capabilities. To alleviate these issues, we propose an observability-aware active calibration method for ground robots with multimodal sensors, including a microphone array, a LiDAR (exteroceptive sensors), and wheel encoders (proprioceptive sensors). Unlike traditional approaches, our method enables active trajectory optimization for online data collection and calibration, contributing to the development of more intelligent robotic systems. Specifically, we leverage the Fisher information matrix (FIM) to quantify parameter observability and adopt its minimum eigenvalue as an optimization metric for trajectory generation via B-spline curves. Through planning and replanning of robot trajectory online, the method enhances the observability of multi-sensor extrinsic parameters. The effectiveness and advantages of our method have been demonstrated through numerical simulations and real-world experiments. For the benefit of the community, we have also open-sourced our code and data at https://github.com/AISLAB-sustech/Multisensor-Calibration.
Chinese: 本文提出了一种面向地面机器人多模态传感器的可观测性主动标定方法,通过在线优化轨迹利用费舍尔信息矩阵提升外参可观测性,并已通过仿真和实验验证其有效性。
English: This paper introduces an observability-aware active calibration method for ground robots with multimodal sensors, which optimizes trajectories online using the Fisher information matrix to enhance extrinsic parameter observability and has been validated through simulations and experiments.
Authors:Xiang Yu, Yayan Chen, Guannan He, Qing Zeng, Yue Qin, Meiling Liang, Dandan Luo, Yimei Liao, Zeyu Ren, Cheng Kang, Delong Yang, Bocheng Liang, Bin Pu, Ying Yuan, Shengli Li
Abstract:
While modern segmentation models often prioritize performance over practicality, we advocate a design philosophy prioritizing simplicity and efficiency, and attempted high performance segmentation model design. This paper presents SimpleUNet, a scalable ultra-lightweight medical image segmentation model with three key innovations: (1) A partial feature selection mechanism in skip connections for redundancy reduction while enhancing segmentation performance; (2) A fixed-width architecture that prevents exponential parameter growth across network stages; (3) An adaptive feature fusion module achieving enhanced representation with minimal computational overhead. With a record-breaking 16 KB parameter configuration, SimpleUNet outperforms LBUNet and other lightweight benchmarks across multiple public datasets. The 0.67 MB variant achieves superior efficiency (8.60 GFLOPs) and accuracy, attaining a mean DSC/IoU of 85.76%/75.60% on multi-center breast lesion datasets, surpassing both U-Net and TransUNet. Evaluations on skin lesion datasets (ISIC 2017/2018: mDice 84.86%/88.77%) and endoscopic polyp segmentation (KVASIR-SEG: 86.46%/76.48% mDice/mIoU) confirm consistent dominance over state-of-the-art models. This work demonstrates that extreme model compression need not compromise performance, providing new insights for efficient and accurate medical image segmentation. Codes can be found at https://github.com/Frankyu5666666/SimpleUNet.
中文: 本文提出SimpleUNet超轻量医学图像分割模型,通过特征选择机制、固定宽度结构和自适应融合模块三大创新,在仅16KB参数量下实现卓越性能,证明了极致模型压缩无需以精度为代价。
English: This paper introduces SimpleUNet, an ultra-lightweight medical image segmentation model that achieves state-of-the-art performance with minimal parameters through innovations in feature selection, fixed-width architecture, and adaptive fusion, demonstrating that extreme compression need not sacrifice accuracy.
Authors:Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, Haobo Wang
Abstract:
With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs' perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.
中文: 本文提出RealHiTBench这一评估大语言模型处理复杂表格数据能力的挑战性基准,并开发了TreeThinker树状结构方法来提升模型对表格层次结构的感知能力。
English: This paper introduces RealHiTBench, a challenging benchmark for evaluating LLMs and MLLMs on complex tabular data across multiple formats, and proposes TreeThinker, a tree-based method to enhance table hierarchy perception.
Authors:Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
Abstract:
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
中文摘要:本研究提出TR2M框架,通过融合文本和图像输入,在像素级别将相对深度转换为度量深度,在多个数据集上表现出优异性能,并展现出卓越的零样本泛化能力。
English Summary: This study introduces TR2M, a novel framework that leverages text and image inputs to convert relative depth into metric depth at the pixel level, demonstrating strong performance across diverse datasets and superior zero-shot generalization capabilities.
Authors:Ciro Beneduce, Tania Gullón Muñoz-Repiso, Bruno Lepri, Massimiliano Luca
Abstract:
Mobility patterns play a critical role in a wide range of societal challenges, from epidemic modeling and emergency response to transportation planning and regional development. Yet, access to high-quality, timely, and openly available mobility data remains limited. In response, the Spanish Ministry of Transportation and Sustainable Mobility has released daily mobility datasets based on anonymized mobile phone data, covering districts, municipalities, and greater urban areas from February 2020 to June 2021 and again from January 2022 onward. This paper presents pySpainMobility, a Python package that simplifies access to these datasets and their associated study areas through a standardized, well-documented interface. By lowering the technical barrier to working with large-scale mobility data, the package enables reproducible analysis and supports applications across research, policy, and operational domains. The library is available at https://github.com/pySpainMobility.
中文: pySpainMobility这一Python软件包简化了对西班牙官方匿名手机移动数据集的访问,为研究和政策领域的可重复分析提供了便利工具。
English: The pySpainMobility Python package provides easy access to Spain's official mobility datasets derived from anonymized mobile phone data, facilitating reproducible analysis across research and policy applications.
Authors:Yan Chen, Hanlin Shang, Ce Liu, Yuxuan Chen, Hui Li, Weihao Yuan, Hao Zhu, Zilong Dong, Siyu Zhu
Abstract:
Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.
中文: 本文提出一种新颖的视频人脸修复方法,通过变分潜在建模将VQ-VAE扩展至视频领域,采用狄利克雷连续表示和时空Transformer架构实现时序一致性和高质量细节恢复,在多项任务中展现出领先性能。
English: This paper introduces a novel video face restoration method that extends VQ-VAEs to video through variational latent modeling, using a Dirichlet-based continuous representation and spatio-temporal Transformer to achieve temporal consistency and high-quality detail recovery, demonstrating state-of-the-art performance across multiple tasks.
Authors:Mae Younes, Adnane Boukhayma
Abstract:
Gaussian Splatting have demonstrated remarkable novel view synthesis performance at high rendering frame rates. Optimization-based inverse rendering within complex capture scenarios remains however a challenging problem. A particular case is modelling complex surface light interactions for highly reflective scenes, which results in intricate high frequency specular radiance components. We hypothesize that such challenging settings can benefit from increased representation power. We hence propose a method that tackles this issue through a geometrically and physically grounded Gaussian Splatting borne radiance field, where normals and material properties are spatially variable in the primitive's local space. Using per-primitive texture maps for this purpose, we also propose to harness the GPU hardware to accelerate rendering at test time via unified material texture atlas.
Chinese: 高斯泼溅技术在新视角合成中实现了高速渲染,但在处理反射场景的反向渲染时存在挑战,因此提出一种方法,通过高斯辐射场中局部空间的可变法线和材质属性来增强表示能力。
English: Gaussian Splatting achieves high-speed rendering for novel view synthesis but struggles with inverse rendering in reflective scenes, leading to a proposed method that enhances representation through localized normals and material properties in a Gaussian-based radiance field.
Authors:Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu
Abstract:
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers
中文: 本研究评估了12个预训练大语言模型和一个专业事实核查器,发现解决数据集模糊性、利用少样本示例的前沿模型以及通过合成推理数据增强小型模型的能力,对构建可靠的事实核查系统至关重要。
English: This study evaluates 12 pre-trained LLMs and a specialized fact-verifier, revealing that addressing dataset ambiguities, leveraging frontier LLMs with few-shot examples, and enhancing small models with synthetic reasoning data are crucial for developing robust fact verification systems.
Authors:Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, Yunhe Wang
Abstract:
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq1/EAQuant.
中文: EAQuant是一种专为专家混合模型设计的新型训练后量化框架,通过三项关键技术解决激活异常值、路由器一致性和专家校准问题,在各种量化配置下实现了最先进的性能提升。
English: EAQuant is a novel post-training quantization framework designed for Mixture-of-Experts models that addresses activation outliers, router consistency, and expert calibration through three key innovations, achieving state-of-the-art performance improvements across various quantization settings.
Authors:Bo Pan, Yixiao Fu, Ke Wang, Junyu Lu, Lunke Pan, Ziyang Qian, Yuhan Chen, Guoliang Wang, Yitao Zhou, Li Zheng, Yinghao Tang, Zhen Wen, Yuchen Wu, Junhua Lu, Biao Zhu, Minfeng Zhu, Bo Zhang, Wei Chen
Abstract:
Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement. In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations. At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques. We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach. Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models. Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. Our project page: https://github.com/bopan3/VIS-Shepherd.
中文: VIS-Shepherd提出了一种基于多模态大语言模型的专门评估器,通过高质量的可视化评述数据集来改进大语言模型生成的数据可视化效果,使得较小模型也能达到与大型模型相当的性能水平。
English: VIS-Shepherd introduces a specialized MLLM-based critic that evaluates and improves LLM-generated data visualizations through high-quality critique datasets, enabling smaller models to achieve performance comparable to larger ones.
Authors:Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He
Abstract:
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.
中文: 本文提出可听动作时序定位任务,旨在识别可听动作的时空坐标,并开发了TA²Net架构,通过运动二阶导数无需音频输入即可检测碰撞时刻,在新基准数据集Audible623上验证了其有效性。
English: This paper introduces Audible Action Temporal Localization, a task identifying spatio-temporal coordinates of audible movements, and proposes TA²Net, a novel architecture using motion's second derivative to detect collision timings without audio input, validated on the new Audible623 benchmark dataset.
Authors:Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Abstract:
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
中文摘要:SeqPE提出了一种统一且完全可学习的位置编码框架,通过符号序列表示位置索引并采用对比学习与知识蒸馏目标,显著提升了外推能力和多模态适应性。
English Summary: SeqPE introduces a unified, learnable position encoding framework that uses symbolic sequences and complementary objectives to enhance extrapolation and adaptability across various tasks and modalities.
Authors:Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, Diange Yang
Abstract:
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.
中文: COME框架通过场景中心化预测将自车运动与环境变化解耦,在nuScenes-Occ3D数据集上相比现有最优方法实现了显著性能提升,证明了分离表征学习对提升世界模型预测精度的有效性。
English: The COME framework enhances autonomous driving world models by disentangling ego-motion from scene dynamics through scene-centric forecasting, achieving significant performance improvements over state-of-the-art methods on the nuScenes-Occ3D dataset.
Authors:Jiashu Dai, Along Wang, Binfan Ni, Tao Cao
Abstract:
Facial texture generation is crucial for high-fidelity 3D face reconstruction from a single image. However, existing methods struggle to generate UV albedo maps with high-frequency details. To address this challenge, we propose a novel end-to-end coarse-to-fine approach for UV albedo map generation. Our method first utilizes a UV Albedo Parametric Model (UVAPM), driven by low-dimensional coefficients, to generate coarse albedo maps with skin tones and low-frequency texture details. To capture high-frequency details, we train a detail generator using a decoupled albedo map dataset, producing high-resolution albedo maps. Extensive experiments demonstrate that our method can generate high-fidelity textures from a single image, outperforming existing methods in terms of texture quality and realism. The code and pre-trained model are publicly available at https://github.com/MVIC-DAI/UVAPM, facilitating reproducibility and further research.
中文: 本文提出一种新颖的由粗到精方法,先通过参数化模型生成基础面部纹理,再借助细节生成器增强高频细节,在单图像3D人脸重建中实现了卓越的纹理质量和真实感。
English: This paper introduces a novel coarse-to-fine approach that first generates basic facial textures using a parametric model and then enhances them with high-frequency details through a detail generator, achieving superior texture quality and realism in single-image 3D face reconstruction.
Authors:Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata
Abstract:
As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.
Chinese: Align-then-Unlearn框架通过在语义嵌入空间执行遗忘操作,有效移除大型语言模型中的特定知识,同时保持模型整体性能,为解决隐私问题提供了新途径。
English: The Align-then-Unlearn framework addresses privacy concerns in large language models by performing unlearning in the semantic embedding space, effectively removing targeted knowledge while preserving overall model utility.
Authors:Ting Qiao, Yiming Li, Jianbin Li, Yingjia Wang, Leyi Qi, Junfeng Guo, Ruili Feng, Dacheng Tao
Abstract:
Deep neural networks (DNNs) rely heavily on high-quality open-source datasets (e.g., ImageNet) for their success, making dataset ownership verification (DOV) crucial for protecting public dataset copyrights. In this paper, we find existing DOV methods (implicitly) assume that the verification process is faithful, where the suspicious model will directly verify ownership by using the verification samples as input and returning their results. However, this assumption may not necessarily hold in practice and their performance may degrade sharply when subjected to intentional or unintentional perturbations. To address this limitation, we propose the first certified dataset watermark (i.e., CertDW) and CertDW-based certified dataset ownership verification method that ensures reliable verification even under malicious attacks, under certain conditions (e.g., constrained pixel-level perturbation). Specifically, inspired by conformal prediction, we introduce two statistical measures, including principal probability (PP) and watermark robustness (WR), to assess model prediction stability on benign and watermarked samples under noise perturbations. We prove there exists a provable lower bound between PP and WR, enabling ownership verification when a suspicious model's WR value significantly exceeds the PP values of multiple benign models trained on watermark-free datasets. If the number of PP values smaller than WR exceeds a threshold, the suspicious model is regarded as having been trained on the protected dataset. Extensive experiments on benchmark datasets verify the effectiveness of our CertDW method and its resistance to potential adaptive attacks. Our codes are at \href{https://github.com/NcepuQiaoTing/CertDW}{GitHub}.
中文: 本文提出CertDW认证数据集水印方法,通过统计指标评估模型预测稳定性,在恶意攻击下仍能实现可靠的数据集所有权验证,并提供可证明的保障。
English: This paper introduces CertDW, a certified dataset watermarking method that ensures reliable dataset ownership verification under malicious attacks by leveraging statistical measures to assess model prediction stability and providing provable guarantees.
Authors:Thomas Möllenhoff, Siddharth Swaroop, Finale Doshi-Velez, Mohammad Emtiyaz Khan
Abstract:
ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which exploits a duality of the posterior distributions obtained by solving a variational-Bayesian reformulation of the original problem. We show that this naturally recovers the original ADMM when isotropic Gaussian posteriors are used, and yields non-trivial extensions for other posterior forms. For instance, full-covariance Gaussians lead to Newton-like variants of ADMM, while diagonal covariances result in a cheap Adam-like variant. This is especially useful to handle heterogeneity in federated deep learning, giving up to 7% accuracy improvements over recent baselines. Our work opens a new Bayesian path to improve primal-dual methods.
Chinese: 本研究提出了一种基于贝叶斯对偶的新方法,从根本上重构了联邦ADMM,通过引入牛顿式和Adam式变体,在异构联邦学习环境中将准确率提升高达7%。
English: The study introduces a novel Bayesian duality approach to fundamentally reformulate federated ADMM, enabling extensions like Newton-like and Adam-like variants that improve accuracy by up to 7% in heterogeneous federated learning settings.
Authors:Jinguang Tong, Xuesong li, Fahira Afzal Maken, Sundaram Muthu, Lars Petersson, Chuong Nguyen, Hongdong Li
Abstract:
3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from foundation models. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster. Code is available at https://github.com/hirotong/GS2DGS
中文摘要:本文提出GS-2DGS新方法,结合高斯泼溅的快速渲染与基础模型的几何约束,实现了对高反光物体的高质量高效三维重建,在速度和细节上均优于现有技术。
English Summary: This paper introduces GS-2DGS, a novel method that combines Gaussian Splatting's fast rendering with geometric constraints from foundation models to achieve high-quality, efficient 3D reconstruction of reflective objects, outperforming existing techniques in speed and detail.
Authors:Shahram Najam Syed, Ishir Roongta, Kavin Ravie, Gangadhar Nageswar
Abstract:
Visual simultaneous localization and mapping (SLAM) must remain accurate under extreme viewpoint, scale and illumination variations. The widely adopted ORB-SLAM3 falters in these regimes because it relies on hand-crafted ORB keypoints. We introduce SuperPoint-SLAM3, a drop-in upgrade that (i) replaces ORB with the self-supervised SuperPoint detector--descriptor, (ii) enforces spatially uniform keypoints via adaptive non-maximal suppression (ANMS), and (iii) integrates a lightweight NetVLAD place-recognition head for learning-based loop closure.
On the KITTI Odometry benchmark SuperPoint-SLAM3 reduces mean translational error from 4.15% to 0.34% and mean rotational error from 0.0027 deg/m to 0.0010 deg/m. On the EuRoC MAV dataset it roughly halves both errors across every sequence (e.g., V2\_03: 1.58% -> 0.79%). These gains confirm that fusing modern deep features with a learned loop-closure module markedly improves ORB-SLAM3 accuracy while preserving its real-time operation.
Implementation, pretrained weights and reproducibility scripts are available at https://github.com/shahram95/SuperPointSLAM3.
中文:SuperPoint-SLAM3通过融合SuperPoint检测器、自适应非极大值抑制和NetVLAD闭环模块,显著降低了在KITTI和EuRoC数据集上的定位误差,同时保持了ORB-SLAM3的实时运行特性。
English: SuperPoint-SLAM3 enhances ORB-SLAM3 by integrating the SuperPoint detector, adaptive non-maximal suppression, and a NetVLAD loop-closure module, significantly reducing localization errors on KITTI and EuRoC benchmarks while maintaining real-time performance.
Authors:Qingfeng Chen, Shiyuan Li, Yixin Liu, Shirui Pan, Geoffrey I. Webb, Shichao Zhang
Abstract:
Graph neural networks (GNNs) excel in graph representation learning by integrating graph structure and node features. Existing GNNs, unfortunately, fail to account for the uncertainty of class probabilities that vary with the depth of the model, leading to unreliable and risky predictions in real-world scenarios. To bridge the gap, in this paper, we propose a novel Evidence Fusing Graph Neural Network (EFGNN for short) to achieve trustworthy prediction, enhance node classification accuracy, and make explicit the risk of wrong predictions. In particular, we integrate the evidence theory with multi-hop propagation-based GNN architecture to quantify the prediction uncertainty of each node with the consideration of multiple receptive fields. Moreover, a parameter-free cumulative belief fusion (CBF) mechanism is developed to leverage the changes in prediction uncertainty and fuse the evidence to improve the trustworthiness of the final prediction. To effectively optimize the EFGNN model, we carefully design a joint learning objective composed of evidence cross-entropy, dissonance coefficient, and false confident penalty. The experimental results on various datasets and theoretical analyses demonstrate the effectiveness of the proposed model in terms of accuracy and trustworthiness, as well as its robustness to potential attacks. The source code of EFGNN is available at https://github.com/Shiy-Li/EFGNN.
Chinese: 本文提出了一种新颖的证据融合图神经网络(EFGNN),通过将证据理论与多跳传播相结合,提高了节点分类的准确性并量化预测不确定性,从而增强了预测的可靠性和抗攻击鲁棒性。
English: This paper introduces a novel Evidence Fusing Graph Neural Network (EFGNN) that integrates evidence theory with multi-hop propagation to enhance node classification accuracy and quantify prediction uncertainty, achieving improved trustworthiness and robustness against attacks.
Authors:Qidi Fang, Hang Yu, Shijie Fang, Jindan Huang, Qiuyu Chen, Reuben M. Aronson, Elaine S. Short
Abstract:
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investigating how human feedback patterns are associated with human characteristics. We conducted a public space study with two long horizon tasks and 46 participants. We found that feedback patterns are not only correlated with task statistics, such as rewards, but also correlated with participants' characteristics, especially robot experience and educational background. Additionally, we demonstrated that human feedback value can be more accurately predicted with human characteristics compared to only using task statistics. All human feedback and characteristics we collected, and codes for our data collection and predicting more accurate human feedback are available at https://github.com/AABL-Lab/CHARM
中文摘要:强化学习中的人类反馈模式不仅与任务统计数据相关,还受到参与者个体特征的影响,结合人类特征比仅使用任务数据能更准确地预测反馈价值。
English Summary: Human feedback patterns in reinforcement learning are influenced by both task statistics and individual characteristics, with the inclusion of human traits enabling more accurate prediction of feedback value than relying solely on task data.
Authors:Xuhui Zhu, Jing Xu, Bingjie Wang, Huikang Dai, Hao Lu
Abstract:
Video Individual Counting (VIC) is a recently introduced task that aims to estimate pedestrian flux from a video. It extends conventional Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that only learns to count repeated pedestrian patterns across frames, the key problem of VIC is how to identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, mainly follow a one-to-one (O2O) matching strategy where the same pedestrian must be exactly matched between frames, leading to sensitivity to appearance variations or missing detections. In this work, we show that the O2O matching could be relaxed to a one-to-many (O2M) matching problem, which better fits the problem nature of VIC and can leverage the social grouping behavior of walking pedestrians. We therefore introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg, featuring an implicit context generator and a one-to-many pairwise matcher. Experiments on the SenseCrowd and CroHD benchmarks show that OMAN achieves the state-of-the-art performance. Code is available at \href{https://github.com/tiny-smart/OMAN}{OMAN}.
中文: 视频个体计数(VIC)通过解决跨帧识别行人的对应问题扩展了视频人群计数,所提出的OMAN模型采用一对多匹配策略,在基准测试中实现了最先进的性能。
English: Video Individual Counting (VIC) extends Video Crowd Counting by addressing the correspondence problem of identifying pedestrians across frames, and the proposed OMAN model with one-to-many matching achieves state-of-the-art performance on benchmarks.
Authors:Kai Tang, Ji Zhang, Hua Meng, Minbo Ma, Qi Xiong, Fengmao Lv, Jie Xu, Tianrui Li
Abstract:
Multivariate time series forecasting (MTSF) is a critical task with broad applications in domains such as meteorology, transportation, and economics. Nevertheless, pervasive missing values caused by sensor failures or human errors significantly degrade forecasting accuracy. Prior efforts usually employ an impute-then-forecast paradigm, leading to suboptimal predictions due to error accumulation and misaligned objectives between the two stages. To address this challenge, we propose the Collaborative Imputation-Forecasting Network (CoIFNet), a novel framework that unifies imputation and forecasting to achieve robust MTSF in the presence of missing values. Specifically, CoIFNet takes the observed values, mask matrix and timestamp embeddings as input, processing them sequentially through the Cross-Timestep Fusion (CTF) and Cross-Variate Fusion (CVF) modules to capture temporal dependencies that are robust to missing values. We provide theoretical justifications on how our CoIFNet learning objective improves the performance bound of MTSF with missing values. Through extensive experiments on challenging MSTF benchmarks, we demonstrate the effectiveness and computational efficiency of our proposed approach across diverse missing-data scenarios, e.g., CoIFNet outperforms the state-of-the-art method by $\underline{\textbf{24.40}}$% ($\underline{\textbf{23.81}}$%) at a point (block) missing rate of 0.6, while improving memory and time efficiency by $\underline{\boldsymbol{4.3\times}}$ and $\underline{\boldsymbol{2.1\times}}$, respectively. Our code is available at: https://github.com/KaiTang-eng/CoIFNet.
Chinese: 提出的CoIFNet框架将填补与预测统一起来,在存在缺失值的多元时间序列预测中提升了准确性,相比现有方法实现了显著的性能提升和计算效率。
English: The proposed CoIFNet framework unifies imputation and forecasting to enhance multivariate time series forecasting accuracy in the presence of missing values, achieving significant performance improvements and computational efficiency over existing methods.
Authors:Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Abstract:
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.
Chinese: 提出的多极注意力方法通过仅对关键标记计算精确注意力并对其他标记进行聚类近似,加速大型推理模型的自回归推理过程,在保持精度的同时实现了长上下文应用中高达4.5倍的注意力加速。
English: The proposed Multipole Attention method accelerates autoregressive reasoning in Large Reasoning Models by computing exact attention only for key tokens and approximating others through clustering, maintaining accuracy while achieving up to 4.5× speedup in long-context applications.
Authors:Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
Abstract:
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall. Please refer to our project page for open-source information.
Chinese: 本文提出Metis-RISE多模态推理模型,该方法先通过强化学习激发模型推理能力,再采用监督微调解决采样效率低和知识缺失问题,在评测中取得了同类模型的最佳性能。
English: This paper introduces Metis-RISE, a novel multimodal reasoning model that starts with reinforcement learning to activate reasoning capabilities before applying supervised fine-tuning to address inefficiencies and knowledge gaps, achieving state-of-the-art performance on benchmarks.
Authors:Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey
Abstract:
Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available at https://github.com/k2-fsa/ZipVoice.
中文: ZipVoice是一种紧凑快速的零样本文本转语音模型,在保持顶尖语音质量的同时,模型体积缩小3倍且推理速度提升高达30倍。
English: ZipVoice is a compact and fast zero-shot text-to-speech model that achieves state-of-the-art speech quality while being 3 times smaller and up to 30 times faster than existing models.
Authors:Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Abstract:
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
中文摘要:本研究提出一个多尺度多晶体数据集及两项基于物理原理的评估基准,通过空间排除和成分排除协议,系统测试多模态基础模型在晶体学推理中的泛化能力、物理一致性及可靠性。
English Summary: This study introduces a multiscale multicrystal dataset with two physics-grounded benchmarks to evaluate multimodal foundation models' crystallographic reasoning by testing their generalization, physical consistency, and reliability through structured exclusion protocols.
Authors:Adhrith Vutukuri, Akash Awasthi, David Yang, Carol C. Wu, Hien Van Nguyen
Abstract:
Chest radiography is widely used in diagnostic imaging. However, perceptual errors -- especially overlooked but visible abnormalities -- remain common and clinically significant. Current workflows and AI systems provide limited support for detecting such errors after interpretation and often lack meaningful human--AI collaboration. We introduce RADAR (Radiologist--AI Diagnostic Assistance and Review), a post-interpretation companion system. RADAR ingests finalized radiologist annotations and CXR images, then performs regional-level analysis to detect and refer potentially missed abnormal regions. The system supports a "second-look" workflow and offers suggested regions of interest (ROIs) rather than fixed labels to accommodate inter-observer variation. We evaluated RADAR on a simulated perceptual-error dataset derived from de-identified CXR cases, using F1 score and Intersection over Union (IoU) as primary metrics. RADAR achieved a recall of 0.78, precision of 0.44, and an F1 score of 0.56 in detecting missed abnormalities in the simulated perceptual-error dataset. Although precision is moderate, this reduces over-reliance on AI by encouraging radiologist oversight in human--AI collaboration. The median IoU was 0.78, with more than 90% of referrals exceeding 0.5 IoU, indicating accurate regional localization. RADAR effectively complements radiologist judgment, providing valuable post-read support for perceptual-error detection in CXR interpretation. Its flexible ROI suggestions and non-intrusive integration position it as a promising tool in real-world radiology workflows. To facilitate reproducibility and further evaluation, we release a fully open-source web implementation alongside a simulated error dataset. All code, data, demonstration videos, and the application are publicly available at https://github.com/avutukuri01/RADAR.
中文摘要:RADAR是一种放射科医生解读后的辅助系统,通过区域分析和兴趣区建议来检测胸部X光片中可能被忽略的异常区域,在模拟错误检测中实现了0.78的召回率和0.56的F1分数,有效促进了人机协作。
English Summary: RADAR is a post-interpretation AI system that assists radiologists by detecting potentially missed abnormalities in chest X-rays through regional analysis and ROI suggestions, achieving effective human-AI collaboration with 0.78 recall and 0.56 F1 score in simulated error detection.
Authors:Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Abstract:
The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
中文: 本综述系统梳理了生成式AI模型的持续学习方法,通过借鉴人类记忆机制提出三种解决灾难性遗忘的范式,并深入分析了不同模型的训练设置与评估基准。
English: This survey systematically categorizes and analyzes continual learning methods for generative AI models to address catastrophic forgetting, drawing inspiration from human memory mechanisms and examining various training setups and benchmarks.
Authors:Christian Hilaire, Sima Didari
Abstract:
We propose a novel active learning framework for multi-view semantic segmentation. This framework relies on a new score that measures the discrepancy between point cloud distributions generated from the extra geometrical information derived from the model's prediction across different views. Our approach results in a data efficient and explainable active learning method. The source code is available at https://github.com/chilai235/viewpclAL.
Chinese: 本文提出了一种新颖的多视角语义分割主动学习框架,通过基于模型预测生成的点云分布差异评分,实现了数据高效且可解释的学习方法。
English: This paper introduces a novel active learning framework for multi-view semantic segmentation that utilizes a discrepancy score based on point cloud distributions from model predictions to achieve data efficiency and explainability.
Authors:Siqi Liang, Yudi Zhang, Yubo Wang
Abstract:
Sequential recommender systems aim to model users' evolving preferences by capturing patterns in their historical interactions. Recent advances in this area have leveraged deep neural networks and attention mechanisms to effectively represent sequential behaviors and time-sensitive interests. In this work, we propose C-TLSAN (Content-Enhanced Time-Aware Long- and Short-Term Attention Network), an extension of the TLSAN architecture that jointly models long- and short-term user preferences while incorporating semantic content associated with items, such as product descriptions.
C-TLSAN enriches the recommendation pipeline by embedding textual content linked to users' historical interactions directly into both long-term and short-term attention layers. This allows the model to learn from both behavioral patterns and rich item content, enhancing user and item representations across temporal dimensions. By fusing sequential signals with textual semantics, our approach improves the expressiveness and personalization capacity of recommendation systems.
We conduct extensive experiments on large-scale Amazon datasets, benchmarking C-TLSAN against state-of-the-art baselines, including recent sequential recommenders based on Large Language Models (LLMs), which represent interaction history and predictions in text form. Empirical results demonstrate that C-TLSAN consistently outperforms strong baselines in next-item prediction tasks. Notably, it improves AUC by 1.66%, Recall@10 by 93.99%, and Precision@10 by 94.80% on average over the best-performing baseline (TLSAN) across 10 Amazon product categories. These results highlight the value of integrating content-aware enhancements into temporal modeling frameworks for sequential recommendation. Our code is available at https://github.com/booml247/cTLSAN.
中文: 本文提出的C-TLSAN模型通过将物品语义内容融入长短时注意力层来增强时序偏好建模,在亚马逊数据集上的实验表明其显著超越了现有最优基线模型。
English: This paper introduces C-TLSAN, a sequential recommendation model that enhances temporal preference modeling by integrating item content into both long- and short-term attention layers, achieving significant performance improvements over state-of-the-art baselines on Amazon datasets.
Authors:Christian Zhou-Zheng, Philippe Pasquier
Abstract:
Existing work in automatic music generation has primarily focused on end-to-end systems that produce complete compositions or continuations. However, because musical composition is typically an iterative process, such systems make it difficult to engage in the back-and-forth between human and machine that is essential to computer-assisted creativity. In this study, we address the task of personalizable, multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a novel model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for personalization in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several quantitative and qualitative metrics, and release model weights and code at https://github.com/christianazinn/MIDI-RWKV.
Chinese: 本研究提出了MIDI-RWKV这一新型模型,通过其线性架构和状态调优功能,在边缘设备上实现个性化、可控的符号音乐填充,从而有效促进计算机辅助音乐创作中的协同创作过程。
English: This study introduces MIDI-RWKV, a novel model for personalized and controllable symbolic music infilling that enhances computer-assisted composition by enabling efficient musical co-creation on edge devices through its linear architecture and state tuning capabilities.
Authors:Xinyi Zhao, Congjing Zhang, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, Shuai Huang
Abstract:
Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in the current models' ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github.com/Xinyi-0724/SmartHome-Bench-LLM.
中文摘要:本文提出了首个专为智能家居场景设计的视频异常检测基准SmartHome-Bench,并开发了新型TRLC框架,将检测准确率显著提升了11.62%。
English Summary: This paper introduces SmartHome-Bench, the first specialized benchmark for video anomaly detection in smart home environments, and proposes a novel framework called TRLC that significantly improves detection accuracy by 11.62%.
Authors:Xiaoya Tang, Bodong Zhang, Man Minh Ho, Beatrice S. Knudsen, Tolga Tasdizen
Abstract:
Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.
中文: 本研究提出了一种新颖的分层Transformer模型,融合了CNN的特征提取能力和ViT的表征优势,通过多尺度学习和注意力机制显著提升了医学图像分类的准确性。
English: This study introduces a novel hierarchical transformer model that combines CNNs' feature extraction with ViTs' representational power, enhancing medical image classification accuracy through multi-scale learning and attention mechanisms.
Authors:Naihao Deng, Kapotaksha Das, Rada Mihalcea, Vitaliy Popov, Mohamed Abouelenien
Abstract:
In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected CliniDial from simulations of medical operations. CliniDial includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for CliniDial. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs' capabilities on handling data with these characteristics. Experimental results show that CliniDial poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial
中文: 本研究介绍了来自医疗模拟的多模态数据集CliniDial,其标签不平衡、互动丰富且多模态的特点对现有大语言模型构成挑战,呼吁开发新方法以处理真实临床数据。
English: The study introduces CliniDial, a multimodal dataset from medical simulations, highlighting its label imbalances, rich interactions, and multiple modalities, which challenge existing LLMs and call for new methods to handle real-world clinical data.
Authors:Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui
Abstract:
While large language models have demonstrated impressive reasoning abilities, their extension to the audio modality, particularly within large audio-language models (LALMs), remains underexplored. Addressing this gap requires a systematic approach that involves a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this work, we present a comprehensive solution for audio logical reasoning (ALR) tasks: we introduce SoundMind, a dataset of 6,446 audio-text annotated samples specifically curated to support complex reasoning. Building on this resource, we propose SoundMind-RL, a rule-based reinforcement learning (RL) algorithm designed to equip audio-language models with robust audio-text reasoning capabilities. By fine-tuning Qwen2.5-Omni-7B on the proposed SoundMind dataset using SoundMind-RL, we achieve strong and consistent improvements over state-of-the-art baselines on the SoundMind benchmark. This work highlights the benefit of combining high-quality, reasoning-focused datasets with specialized RL techniques, and contributes to advancing auditory intelligence in language models. The code and dataset introduced in this work are publicly available at https://github.com/xid32/SoundMind.
中文摘要:本研究提出了SoundMind专用数据集和基于规则的强化学习算法SoundMind-RL,共同增强了音频语言模型的逻辑推理能力,相比现有方法取得了显著提升。
English Summary: This research introduces SoundMind, a specialized dataset and a rule-based reinforcement learning algorithm called SoundMind-RL, which together enhance audio-language models' logical reasoning capabilities, achieving significant improvements over existing methods.
Authors:William Xia, Ishita Unde, Brian Ondov, Dina Demner-Fushman
Abstract:
Online medical literature has made health information more available than ever, however, the barrier of complex medical jargon prevents the general public from understanding it. Though parallel and comparable corpora for Biomedical Text Simplification have been introduced, these conflate the many syntactic and lexical operations involved in simplification. To enable more targeted development and evaluation, we present a fine-grained lexical simplification task and dataset, Jargon Explanations for Biomedical Simplification (JEBS, https://github.com/bill-from-ri/JEBS-data ). The JEBS task involves identifying complex terms, classifying how to replace them, and generating replacement text. The JEBS dataset contains 21,595 replacements for 10,314 terms across 400 biomedical abstracts and their manually simplified versions. Additionally, we provide baseline results for a variety of rule-based and transformer-based systems for the three sub-tasks. The JEBS task, data, and baseline results pave the way for development and rigorous evaluation of systems for replacing or explaining complex biomedical terms.
中文摘要:JEBS数据集通过提供细粒度词汇简化任务,识别、分类并生成医学术语的替换文本,解决了在线医学文献中复杂术语阻碍公众理解的问题,为开发精准的术语替换系统奠定了基础。
English Summary: The JEBS dataset addresses the challenge of simplifying complex medical jargon in online literature by providing a fine-grained lexical simplification task that identifies, classifies, and generates replacements for technical terms to improve public comprehension.
Authors:Larissa Mori, Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca
Abstract:
Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model's performance and temporal robustness. The code, dataset and appendix related to this work are available on: https://github.com/larimo/lexsem-legal-ir.
中文摘要:本研究探讨了词汇模型与语义模型在检索欧盟法院重复性法律段落时的效能,发现经过领域数据微调的语义模型在多数指标上优于传统BM25方法。
English Summary: This study examines the effectiveness of lexical versus semantic models for retrieving repetitive legal passages from CJEU decisions, finding that fine-tuned dense models generally outperform traditional BM25 methods in most metrics.
Authors:Matan Ben-Tov, Mor Geva, Mahmood Sharif
Abstract:
We study suffix-based jailbreaks$\unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universal$\unicode{x2013}$generalizing to many unseen harmful instructions$\unicode{x2013}$than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to $\times$5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.
中文摘要:本研究揭示基于后缀的越狱攻击通过浅层机制劫持大语言模型的语境化过程,其中通用性越强的后缀劫持能力越强,并提出了既能有效增强攻击又能显著降低其成功率且不影响模型效用的方法。
English Summary: This research reveals that suffix-based jailbreak attacks on LLMs exploit a shallow mechanism to hijack the model's contextualization process, with more universal suffixes showing stronger hijacking effects, and demonstrates methods to both enhance and mitigate these attacks effectively.
Authors:Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao
Abstract:
The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.
中文摘要:该摘要提出了一种名为MaskPro的概率框架,通过学习权重分布的稀疏模式并采用创新的梯度稳定方法,有效提升了大型语言模型的推理效率,展现出卓越的性能和可扩展性。
English Summary: The abstract introduces MaskPro, a probabilistic framework designed to enhance inference efficiency in large language models by learning sparsity patterns through categorical distributions and a novel gradient stabilization method, achieving superior performance and scalability.
Authors:Hao Xu, Lechao Cheng, Yaxiong Wang, Shengeng Tang, Zhun Zhong
Abstract:
We present our solution to the MiGA Challenge at IJCAI 2025, which aims to recognize micro-gestures (MGs) from skeleton sequences for the purpose of hidden emotion understanding. MGs are characterized by their subtlety, short duration, and low motion amplitude, making them particularly challenging to model and classify. We adopt PoseC3D as the baseline framework and introduce three key enhancements: (1) a topology-aware skeleton representation specifically designed for the iMiGUE dataset to better capture fine-grained motion patterns; (2) an improved temporal processing strategy that facilitates smoother and more temporally consistent motion modeling; and (3) the incorporation of semantic label embeddings as auxiliary supervision to improve the model generalization. Our method achieves a Top-1 accuracy of 67.01\% on the iMiGUE test set. As a result of these contributions, our approach ranks third on the official MiGA Challenge leaderboard. The source code is available at \href{https://github.com/EGO-False-Sleep/Miga25_track1}{https://github.com/EGO-False-Sleep/Miga25\_track1}.
中文: 我们通过改进PoseC3D框架,引入拓扑感知骨骼表征、优化时序处理和语义标签嵌入三项关键技术,在微表情识别任务中取得67.01%的Top-1准确率,荣获MiGA挑战赛第三名。
English: We propose an enhanced PoseC3D framework with three key improvements—topology-aware skeleton representation, refined temporal processing, and semantic label embeddings—achieving 67.01% Top-1 accuracy and third place in the MiGA Challenge for micro-gesture-based emotion recognition.
Authors:Xinyuan Xia, Yuanyi Song, Haomin Ma, Jinyu Cai
Abstract:
With the rapid development of LLM-based agents, increasing attention has been given to their social interaction and strategic reasoning capabilities. However, existing Werewolf-based benchmarking platforms suffer from overly simplified game settings, incomplete evaluation metrics, and poor scalability. To address these limitations, we propose WereWolf-Plus, a multi-model, multi-dimensional, and multi-method benchmarking platform for evaluating multi-agent strategic reasoning in the Werewolf game. The platform offers strong extensibility, supporting customizable configurations for roles such as Seer, Witch, Hunter, Guard, and Sheriff, along with flexible model assignment and reasoning enhancement strategies for different roles. In addition, we introduce a comprehensive set of quantitative evaluation metrics for all special roles, werewolves, and the sheriff, and enrich the assessment dimensions for agent reasoning ability, cooperation capacity, and social influence. WereWolf-Plus provides a more flexible and reliable environment for advancing research on inference and strategic interaction within multi-agent communities. Our code is open sourced at https://github.com/MinstrelsyXia/WereWolfPlus.
中文: WereWolf-Plus 是一个先进的基准测试平台,通过提供可定制的角色、灵活的模型分配和全面的评估指标,解决了现有狼人杀游戏评估的局限性,以评估多智能体的策略推理、合作能力及社会影响力。
English: WereWolf-Plus is an advanced benchmarking platform that addresses the limitations of existing Werewolf game evaluations by offering customizable roles, flexible model assignments, and comprehensive metrics to assess multi-agent strategic reasoning, cooperation, and social influence.
Authors:Mustansar Fiaz, Mubashir Noman, Hiyam Debary, Kamran Ali, Hisham Cholakkal
Abstract:
Recently convolution and transformer-based change detection (CD) methods provide promising performance. However, it remains unclear how the local and global dependencies interact to effectively alleviate the pseudo changes. Moreover, directly utilizing standard self-attention presents intrinsic limitations including governing global feature representations limit to capture subtle changes, quadratic complexity, and restricted training parallelism. To address these limitations, we propose a Siamese-based framework, called HyRet-Change, which can seamlessly integrate the merits of convolution and retention mechanisms at multi-scale features to preserve critical information and enhance adaptability in complex scenes. Specifically, we introduce a novel feature difference module to exploit both convolutions and multi-head retention mechanisms in a parallel manner to capture complementary information. Furthermore, we propose an adaptive local-global interactive context awareness mechanism that enables mutual learning and enhances discrimination capability through information exchange. We perform experiments on three challenging CD datasets and achieve state-of-the-art performance compared to existing methods. Our source code is publicly available at https://github.com/mustansarfiaz/HyRect-Change.
Chinese: HyRet-Change框架通过整合卷积和保留机制捕捉多尺度特征,并采用自适应局部-全局交互模块增强变化检测能力,在多个挑战性数据集上取得了领先性能。
English: The HyRet-Change framework integrates convolution and retention mechanisms to capture multi-scale features and uses an adaptive local-global interaction module to enhance change detection, achieving state-of-the-art results on challenging datasets.
Authors:Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, Kai Zhang
Abstract:
Text-driven image editing has achieved remarkable success in following single instructions. However, real-world scenarios often involve complex, multi-step instructions, particularly ``chain'' instructions where operations are interdependent. Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities. Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. ComplexBench-Edit also features a new vision consistency evaluation method that accurately assesses non-modified regions by excluding edited areas. Furthermore, we propose a simple yet powerful Chain-of-Thought (CoT)-based approach that significantly enhances the ability of existing models to follow complex instructions. Our extensive experiments demonstrate ComplexBench-Edit's efficacy in differentiating model capabilities and highlight the superior performance of our CoT-based method in handling complex edits. The data and code are released at https://github.com/llllly26/ComplexBench-Edit.
中文摘要:本文提出了ComplexBench-Edit基准测试,用于评估图像编辑模型处理复杂多步指令的能力,并开发了一种思维链方法,显著提升了模型处理复杂编辑任务的性能。
English Summary: This paper introduces ComplexBench-Edit, a benchmark for evaluating image editing models on complex multi-step instructions, along with a Chain-of-Thought method that significantly improves model performance on such tasks.
Authors:Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, Wentao Zhang
Abstract:
Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.
中文摘要:该摘要提出了RC-Bench这一针对极端视觉条件评估视觉语言模型性能的基准,以及NativeRes-LLaVA这一支持模型处理原生分辨率图像的开源框架,实验证明该方法能显著提升模型表现。
English Summary: The abstract introduces RC-Bench, a benchmark for evaluating Vision-Language Models under extreme visual conditions, and NativeRes-LLaVA, a framework enabling models to process images at native resolutions, demonstrating significant performance improvements.
Authors:Ruojing Li, Wei An, Xinyi Ying, Yingqian Wang, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu
Abstract:
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage ``more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the ``more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.
Chinese: 本文提出DeepPro,一种高效的深度时间探针网络,将红外小目标检测重塑为一维信号异常检测任务,通过利用全局时间显著性和相关性信息,在基准测试中实现了卓越性能和高效率。
English: This paper introduces DeepPro, an efficient deep temporal probe network that redefines infrared small target detection as a one-dimensional signal anomaly detection task, achieving superior performance and high efficiency on benchmarks by leveraging global temporal saliency and correlation information.
Authors:Joon Soo Yoo, Taeho Kim, Ji Won Yoon
Abstract:
Location-based services often require users to share sensitive locational data, raising privacy concerns due to potential misuse or exploitation by untrusted servers. In response, we present VeLoPIR, a versatile location-based private information retrieval (PIR) system designed to preserve user privacy while enabling efficient and scalable query processing. VeLoPIR introduces three operational modes-interval validation, coordinate validation, and identifier matching-that support a broad range of real-world applications, including information and emergency alerts. To enhance performance, VeLoPIR incorporates multi-level algorithmic optimizations with parallel structures, achieving significant scalability across both CPU and GPU platforms. We also provide formal security and privacy proofs, confirming the system's robustness under standard cryptographic assumptions. Extensive experiments on real-world datasets demonstrate that VeLoPIR achieves up to 11.55 times speed-up over a prior baseline. The implementation of VeLoPIR is publicly available at https://github.com/PrivStatBool/VeLoPIR.
中文: VeLoPIR是一种多功能的位置隐私信息检索系统,通过三种操作模式和多级优化技术保护用户隐私,在标准密码学假设下确保安全性,同时实现了显著的可扩展性和性能提升。
English: VeLoPIR is a versatile location-based private information retrieval system that protects user privacy through three operational modes and multi-level optimizations, achieving significant scalability and speed improvements while ensuring security under standard cryptographic assumptions.
Authors:David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
Abstract:
As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increases favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs
中文摘要:本研究提出了一种评估大型语言模型在民主与专制价值观上倾向的新方法,发现模型虽普遍倾向民主价值观,但中文提示下对专制人物好感度上升,且常将其视为榜样。
English Summary: This study introduces a new method to evaluate how large language models align with the democracy-authoritarianism spectrum, finding they generally favor democratic values but show increased preference for authoritarian figures when prompted in Mandarin and often cite them as role models.
Authors:Rong Wu, Ziqi Chen, Liming Zhong, Heng Li, Hai Shu
Abstract:
Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object-aware feature grouping strategy to capture organ-level visual features. It then refines tumor queries by focusing on diffusion-based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion-guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category-sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi-label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at https://github.com/Rows21/k-Means_Mask_Mamba.
中文: 提出的DSM框架结合扩散与状态空间模型,通过对象查询和CLIP文本嵌入技术,实现了对训练数据外未知肿瘤的精准分割,在医学影像任务中展现出卓越性能。
English: The proposed DSM framework combines diffusion and state space models with object queries and CLIP text embeddings to robustly segment unseen tumors beyond training data, demonstrating superior performance in medical imaging tasks.
Authors:Hang Xu, Wei Yu, Jiangtong Tan, Zhen Zou, Feng Zhao
Abstract:
Blind Super-Resolution (blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and overlook the need for generalization in features at intermediate layers. Without explicit regularization of features at intermediate layers, the blind SR network struggles to obtain well-generalized feature representations. However, the key challenge is that directly applying dropout to intermediate layers leads to a significant performance drop, which we attribute to the inconsistency in training-testing and across layers it introduced. Therefore, we propose Adaptive Dropout, a new regularization method for blind SR models, which mitigates the inconsistency and facilitates application across intermediate layers of networks. Specifically, for training-testing inconsistency, we re-design the form of dropout and integrate the features before and after dropout adaptively. For inconsistency in generalization requirements across different layers, we innovatively design an adaptive training strategy to strengthen feature propagation by layer-wise annealing. Experimental results show that our method outperforms all past regularization methods on both synthetic and real-world benchmark datasets, also highly effective in other image restoration tasks. Code is available at \href{https://github.com/xuhang07/Adpative-Dropout}{https://github.com/xuhang07/Adpative-Dropout}.
Chinese: 本文提出自适应Dropout方法,通过自适应特征融合和逐层退火策略解决盲超分辨率中的训练-测试及跨层不一致性问题,在合成与真实基准数据集上超越现有正则化方法,并有效应用于其他图像复原任务。
English: This paper introduces Adaptive Dropout, a novel regularization method for blind super-resolution that addresses training-testing and cross-layer inconsistencies by adaptively integrating features and employing layer-wise annealing, achieving state-of-the-art performance on benchmark datasets and other image restoration tasks.
Authors:Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang
Abstract:
Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).
中文总结:现有代码生成基准难以评估先进大语言模型,因此HLCE汇集了来自顶级编程竞赛的235道高难度题目,发现即使最强模型也表现不佳,揭示其在复杂编程任务上仍有巨大提升空间。
English Summary: Current code generation benchmarks fail to challenge advanced LLMs, so HLCE introduces 235 highly difficult problems from premier programming competitions, revealing that even top models achieve low success rates and have significant room for improvement.
Authors:Xiaoyan Kui, Canwei Liu, Qinsong Li, Zhipeng Hu, Yangyang Shi, Weixin Si, Beiji Zou
Abstract:
Kolmogorov-Arnold Networks (KANs) are highly effective in long-term time series forecasting due to their ability to efficiently represent nonlinear relationships and exhibit local plasticity. However, prior research on KANs has predominantly focused on the time domain, neglecting the potential of the frequency domain. The frequency domain of time series data reveals recurring patterns and periodic behaviors, which complement the temporal information captured in the time domain. To address this gap, we explore the application of KANs in the frequency domain for long-term time series forecasting. By leveraging KANs' adaptive activation functions and their comprehensive representation of signals in the frequency domain, we can more effectively learn global dependencies and periodic patterns. To integrate information from both time and frequency domains, we propose the $\textbf{T}$ime-$\textbf{F}$requency KAN (TFKAN). TFKAN employs a dual-branch architecture that independently processes features from each domain, ensuring that the distinct characteristics of each domain are fully utilized without interference. Additionally, to account for the heterogeneity between domains, we introduce a dimension-adjustment strategy that selectively upscales only in the frequency domain, enhancing efficiency while capturing richer frequency information. Experimental results demonstrate that TFKAN consistently outperforms state-of-the-art (SOTA) methods across multiple datasets. The code is available at https://github.com/LcWave/TFKAN.
中文: 提出的时频KAN(TFKAN)通过双分支架构和选择性维度调整整合时域与频域信息,有效提升了长期时间序列预测性能,在多个数据集上超越了现有最优方法。
English: The proposed Time-Frequency KAN (TFKAN) integrates both time and frequency domains using a dual-branch architecture and selective dimension adjustment to enhance long-term time series forecasting, outperforming existing methods across multiple datasets.
Authors:M. H. Maqbool, Moghis Fereidouni, Umar Farooq, A. B. Siddique, Hassan Foroosh
Abstract:
The mobile app market has expanded exponentially, offering millions of apps with diverse functionalities, yet research in mobile app recommendation remains limited. Traditional sequential recommender systems utilize the order of items in users' historical interactions to predict the next item for the users. Position embeddings, well-established in transformer-based architectures for natural language processing tasks, effectively distinguish token positions in sequences. In sequential recommendation systems, position embeddings can capture the order of items in a user's historical interaction sequence. Nevertheless, this ordering does not consider the time elapsed between two interactions of the same user (e.g., 1 day, 1 week, 1 month), referred to as "user rhythm". In mobile app recommendation datasets, the time between consecutive user interactions is notably longer compared to other domains like movies, posing significant challenges for sequential recommender systems. To address this phenomenon in the mobile app domain, we introduce INTERPOS, an Interaction Rhythm Guided Positional Morphing strategy for autoregressive mobile app recommender systems. INTERPOS incorporates rhythm-guided position embeddings, providing a more comprehensive representation that considers both the sequential order of interactions and the temporal gaps between them. This approach enables a deep understanding of users' rhythms at a fine-grained level, capturing the intricacies of their interaction patterns over time. We propose three strategies to incorporate the morphed positional embeddings in two transformer-based sequential recommendation system architectures. Our extensive evaluations show that INTERPOS outperforms state-of-the-art models using 7 mobile app recommendation datasets on NDCG@K and HIT@K metrics. The source code of INTERPOS is available at https://github.com/dlgrad/INTERPOS.
中文: 本研究提出INTERPOS策略,将节奏引导的位置嵌入整合到基于Transformer的序列推荐系统中,以更全面地捕捉用户交互的顺序和时间间隔,在移动应用推荐任务中展现出卓越性能。
English: The study introduces INTERPOS, a novel strategy that integrates rhythm-guided position embeddings into transformer-based sequential recommender systems to better capture both the order and temporal gaps in user interactions, demonstrating superior performance in mobile app recommendations.
Authors:Wenxiao Cai, Zongru Li, Iris Wang, Yu-Neng Wang, Thomas H. Lee
Abstract:
Machine learning has achieved remarkable advancements but at the cost of significant computational resources. This has created an urgent need for a novel and energy-efficient computational fabric and corresponding algorithms. CMOS Oscillator Networks (OscNet) is a brain inspired and specially designed hardware for low energy consumption. In this paper, we propose a Hopfield Network based machine learning algorithm that can be implemented on OscNet. The network is trained using forward propagation alone to learn sparsely connected weights, yet achieves an 8% improvement in accuracy compared to conventional deep learning models on MNIST dataset. OscNet v1.5 achieves competitive accuracy on MNIST and is well-suited for implementation using CMOS-compatible ring oscillator arrays with SHIL. In oscillator-based inference, we utilize only 24% of the connections used in a fully connected Hopfield network, with merely a 0.1% drop in accuracy. OscNet v1.5 relies solely on forward propagation and employs sparse connections, making it an energy-efficient machine learning pipeline designed for oscillator computing fabric. The repository for OscNet family is: https://github.com/RussRobin/OscNet .
中文: OscNet v1.5 是一种基于CMOS振荡器硬件的高效能机器学习方案,仅通过前向传播和稀疏连接就在MNIST数据集上实现了与主流模型相当的精度,同时将连接数减少至24%且精度损失仅为0.1%。
English: OscNet v1.5 is an energy-efficient machine learning pipeline using forward propagation and sparse connections on CMOS oscillator hardware, achieving competitive accuracy on MNIST with 24% fewer connections and minimal performance loss.
Authors:Renjun Xu, Jingwen Peng
Abstract:
This survey examines the rapidly evolving field of Deep Research systems -- AI-powered applications that automate complex research workflows through the integration of large language models, advanced information retrieval, and autonomous reasoning capabilities. We analyze more than 80 commercial and non-commercial implementations that have emerged since 2023, including OpenAI/Deep Research, Gemini/Deep Research, Perplexity/Deep Research, and numerous open-source alternatives. Through comprehensive examination, we propose a novel hierarchical taxonomy that categorizes systems according to four fundamental technical dimensions: foundation models and reasoning engines, tool utilization and environmental interaction, task planning and execution control, and knowledge synthesis and output generation. We explore the architectural patterns, implementation approaches, and domain-specific adaptations that characterize these systems across academic, scientific, business, and educational applications. Our analysis reveals both the significant capabilities of current implementations and the technical and ethical challenges they present regarding information accuracy, privacy, intellectual property, and accessibility. The survey concludes by identifying promising research directions in advanced reasoning architectures, multimodal integration, domain specialization, human-AI collaboration, and ecosystem standardization that will likely shape the future evolution of this transformative technology. By providing a comprehensive framework for understanding Deep Research systems, this survey contributes to both the theoretical understanding of AI-augmented knowledge work and the practical development of more capable, responsible, and accessible research technologies. The paper resources can be viewed at https://github.com/scienceaix/deepresearch.
中文: 本综述深入分析了深度研究系统,通过提出新的分类法对其进行了系统归类,在指出其技术能力和伦理挑战的同时,为这一人工智能驱动技术的未来发展指明了研究方向。
English: This survey provides a comprehensive analysis of Deep Research systems, categorizing them through a novel taxonomy and highlighting their capabilities alongside technical and ethical challenges, while also identifying future research directions for advancing this AI-driven technology.
Authors:Darryl Ho, Samuel Madden
Abstract:
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of 77.2% on Something-Something V2, 89.1% on Kinetics-400, and 88.6% on HMDB51, while adding fewer than 1.8% additional learnable parameters and requiring less than 3 hours of training time. Our code is available at https://github.com/darrylho/DejaVid.
中文: DejaVid是一种与编码器无关的框架,通过将视频转换为带有时序特征权重的多元时间序列来提升分类性能,无需重新训练或修改结构即可实现最优准确率。
English: DejaVid is an encoder-agnostic framework that enhances video classification by converting videos into temporal sequences with learned feature weights, achieving state-of-the-art accuracy without retraining or architectural changes.
Authors:Chunjiang Wang, Kun Zhang, Yandong Liu, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
Abstract:
The concept bottleneck model (CBM), as a technique improving interpretability via linking predictions to human-understandable concepts, makes high-risk and life-critical medical image classification credible. Typically, existing CBM methods associate the final layer of visual encoders with concepts to explain the model's predictions. However, we empirically discover the phenomenon of concept preference variation, that is, the concepts are preferably associated with the features at different layers than those only at the final layer; yet a blind last-layer-based association neglects such a preference variation and thus weakens the accurate correspondences between features and concepts, impairing model interpretability. To address this issue, we propose a novel Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM), which comprises two key novel modules: (1) intra-layer concept preference modeling, which captures the preferred association of different concepts with features at various visual layers, and (2) multi-layer concept sparse activation fusion, which sparsely aggregates concept activations from multiple layers to enhance performance. Thus, by explicitly modeling concept preferences, MVP-CBM can comprehensively leverage multi-layer visual information to provide a more nuanced and accurate explanation of model decisions. Extensive experiments on several public medical classification benchmarks demonstrate that MVP-CBM achieves state-of-the-art accuracy and interoperability, verifying its superiority. Code is available at https://github.com/wcj6/MVP-CBM.
中文: 提出的多层视觉偏好增强概念瓶颈模型(MVP-CBM)通过捕捉概念与多层视觉特征的关联,解决了概念偏好变化问题,显著提升了医学图像分类的可解释性和准确性。
English: The proposed Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM) addresses concept preference variation by capturing associations between concepts and features across multiple visual layers, enhancing both interpretability and accuracy in medical image classification.
Authors:Zain Muhammad Mujahid, Dilshod Azizov, Maha Tufail Agro, Preslav Nakov
Abstract:
In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code at https://github.com/mbzuai-nlp/llm-media-profiling.
中文: 本研究提出了一种利用大型语言模型评估整个新闻媒体可靠性和政治偏见的新方法,通过关注信息来源特征而非单个声明,改进了传统的核实方式。
English: This study introduces a novel method using large language models to assess the reliability and political bias of entire news outlets, improving upon traditional fact-checking by focusing on source characterization rather than individual claims.
Authors:Catalin E. Brita, Hieu Nguyen, Lohithsai Yadala Chanchu, Domonkos Nagy, Maksim Zhdanov
Abstract:
Self-attention scales quadratically with input size, limiting its use for large-scale physical systems. Although sparse attention mechanisms provide a viable alternative, they are primarily designed for regular structures such as text or images, making them inapplicable for irregular geometries. In this work, we present Ball Sparse Attention (BSA), which adapts Native Sparse Attention (NSA) (Yuan et al., 2025) to unordered point sets by imposing regularity using the Ball Tree structure from the Erwin Transformer (Zhdanov et al., 2025). We modify NSA's components to work with ball-based neighborhoods, yielding a global receptive field at sub-quadratic cost. On an airflow pressure prediction task, we achieve accuracy comparable to Full Attention while significantly reducing the theoretical computational complexity. Our implementation is available at https://github.com/britacatalin/bsa.
中文:球稀疏注意力通过球树结构将稀疏注意力应用于不规则几何形状,在物理系统预测任务中以次二次复杂度实现了与全注意力相当的精度。
English: Ball Sparse Attention adapts sparse attention to irregular geometries using ball tree structures, achieving full-attention accuracy with sub-quadratic complexity on physical systems.
Authors:Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C. H. Ngai
Abstract:
Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.
中文: RealFactBench是一个全面的基准测试,旨在评估大语言模型和多模态大语言模型在多样化现实任务中的事实核查能力,通过整合6000条高质量声明并引入未知率指标来弥补现有评估的不足,有效检验模型处理不确定性的表现。
English: RealFactBench is a comprehensive benchmark designed to evaluate the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, addressing the limitations of existing evaluations by incorporating 6K high-quality claims and introducing the Unknown Rate metric to assess uncertainty handling.
Authors:Peng Wang, Minh Huy Pham, Zhihao Guo, Wei Zhou
Abstract:
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a custom annotation tool, the dataset reflects complex scenarios with similar or identical objects and intricate spatial arrangements. We benchmark six state-of-the-art scene-graph generation models on this dataset, analysing their inference speed and relational accuracy. Our results highlight significant differences in model performance and demonstrate that integrating explicit spatial relationships into foundation models, such as ChatGPT 4o, substantially improves their ability to generate executable, spatially-aware plans for robotics. The dataset and annotation tool are publicly available at https://github.com/PengPaulWang/SpatialAwareRobotDataset, supporting further research in spatial reasoning for robotics.
Chinese: 本研究推出了一个包含1000张机器人拍摄室内图像的空间关系感知数据集,通过基准测试场景图模型,证明整合明确的空间关系能显著提升AI模型生成可执行机器人任务规划的能力。
English: This study introduces a spatial-relationship-aware dataset of 1,000 robot-captured indoor images to enhance robotic task planning by benchmarking scene-graph models and showing that integrating explicit spatial relationships significantly improves AI models' ability to generate executable plans.
Authors:Nuwan Bandara, Thivya Kandappu, Archan Misra
Abstract:
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.
中文: 本文提出了一种与模型无关的优化框架,通过运动感知滤波和光流局部优化来改进基于事件的视线估计,显著提升了视线信号的稳定性以支持认知状态解码,并引入了评估时间平滑度的新型抖动指标。
English: This paper presents a model-agnostic framework that enhances event-based gaze estimation through motion-aware filtering and optical flow refinement, improving signal consistency for cognitive state inference while introducing a novel jitter metric for temporal smoothness evaluation.
Authors:Andrey Asadchev, Edward F. Valeev
Abstract:
We report an implementation of the McMurchie-Davidson evaluation scheme for 1- and 2-particle Gaussian AO integrals designed for processors with Single Instruction Multiple Data (SIMD) instruction sets. Like in our recent MD implementation for graphical processing units (GPUs) [J. Chem. Phys. 160, 244109 (2024)], variable-sized batches of shellsets of integrals are evaluated at a time. By optimizing for the floating point instruction throughput rather than minimizing the number of operations, this approach achieves up to 50% of the theoretical hardware peak FP64 performance for many common SIMD-equipped platforms (AVX2, AVX512, NEON), which translates to speedups of up to 30 over the state-of-the-art one-shellset-at-a-time implementation of Obara-Saika-type schemes in Libint for a variety of primitive and contracted integrals. As with our previous work, we rely on the standard C++ programming language -- such as the std::simd standard library feature to be included in the 2026 ISO C++ standard -- without any explicit code generation to keep the code base small and portable. The implementation is part of the open source LibintX library freely available at https://github.com/ValeevGroup/libintx.
中文: 本研究实现了针对SIMD处理器的McMurchie-Davidson高斯轨道积分计算方案,通过优化浮点指令吞吐量而非最小化操作数,在多种平台上实现了高达理论峰值50%的性能,比现有最优方法快30倍。
English: This study presents a SIMD-optimized implementation of the McMurchie-Davidson method for Gaussian AO integrals, achieving up to 50% of theoretical peak performance and 30x speedup over existing methods through floating-point throughput optimization.
Authors:Zhuocheng Zhang, Yang Feng, Min Zhang
Abstract:
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
Chinese: FlexRAG 是一个开源框架,旨在解决现有 RAG 系统在算法重现、技术更新和系统开销方面的挑战,通过提供全生命周期支持与高效处理能力,助力研究人员快速开发和共享先进的 RAG 应用。
English: FlexRAG is an open-source framework designed to overcome the limitations of existing RAG systems, such as poor reproducibility and high overhead, by offering comprehensive lifecycle support and efficient processing for rapid development and sharing of advanced RAG applications.
Authors:Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung
Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Chinese: 本研究提出了一种音频辅助的视频测试时自适应方法,利用音频生成的伪标签和灵活的自适应周期,在多个受损数据集上显著提升了模型性能。
English: This study introduces an audio-assisted test-time adaptation method for video that leverages audio-generated pseudo-labels and a flexible adaptation cycle to enhance model performance across multiple corrupted datasets.
Authors:Suyeon Kim, SeongKu Kang, Dongwoo Kim, Jungseul Ok, Hwanjo Yu
Abstract:
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in node classification tasks but struggle with label noise in real-world data. Existing studies on graph learning with label noise commonly rely on class-dependent label noise, overlooking the complexities of instance-dependent noise and falling short of capturing real-world corruption patterns. We introduce BeGIN (Benchmarking for Graphs with Instance-dependent Noise), a new benchmark that provides realistic graph datasets with various noise types and comprehensively evaluates noise-handling strategies across GNN architectures, noisy label detection, and noise-robust learning. To simulate instance-dependent corruptions, BeGIN introduces algorithmic methods and LLM-based simulations. Our experiments reveal the challenges of instance-dependent noise, particularly LLM-based corruption, and underscore the importance of node-specific parameterization to enhance GNN robustness. By comprehensively evaluating noise-handling strategies, BeGIN provides insights into their effectiveness, efficiency, and key performance factors. We expect that BeGIN will serve as a valuable resource for advancing research on label noise in graphs and fostering the development of robust GNN training methods. The code is available at https://github.com/kimsu55/BeGIN.
中文摘要:BeGIN基准通过引入实例相关噪声模拟和全面评估抗噪策略,弥补了现有图学习方法的不足,揭示了LLM模拟噪声的特殊挑战,并强调节点特定参数化对提升图神经网络鲁棒性的关键作用。
English Summary: The BeGIN benchmark addresses limitations in existing graph learning methods by introducing realistic instance-dependent noise simulations and comprehensively evaluating noise-handling strategies, revealing the particular challenges of LLM-based corruption while highlighting node-specific parameterization as key for GNN robustness.
Authors:Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao
Abstract:
Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.
中文: ATLAS 2025挑战赛通过对抗性测试系统评估了多模态大语言模型的越狱攻击漏洞,在揭示持续安全挑战的同时,为开发更强防御机制建立了新基准。
English: The ATLAS 2025 challenge systematically evaluated multimodal large language models' vulnerabilities to jailbreak attacks through adversarial testing, revealing persistent safety challenges while establishing new benchmarks for developing stronger defense mechanisms.
Authors:Hyeonseo Lee, Juhyun Park, Jihyong Oh, Chanho Eom
Abstract:
Person Re-identification (ReID) aims to retrieve images of the same individual captured across non-overlapping camera views, making it a critical component of intelligent surveillance systems. Traditional ReID methods assume that the training and test domains share similar characteristics and primarily focus on learning discriminative features within a given domain. However, they often fail to generalize to unseen domains due to domain shifts caused by variations in viewpoint, background, and lighting conditions. To address this issue, Domain-Adaptive ReID (DA-ReID) methods have been proposed. These approaches incorporate unlabeled target domain data during training and improve performance by aligning feature distributions between source and target domains. Domain-Generalizable ReID (DG-ReID) tackles a more realistic and challenging setting by aiming to learn domain-invariant features without relying on any target domain data. Recent methods have explored various strategies to enhance generalization across diverse environments, but the field remains relatively underexplored. In this paper, we present a comprehensive survey of DG-ReID. We first review the architectural components of DG-ReID including the overall setting, commonly used backbone networks and multi-source input configurations. Then, we categorize and analyze domain generalization modules that explicitly aim to learn domain-invariant and identity-discriminative representations. To examine the broader applicability of these techniques, we further conduct a case study on a related task that also involves distribution shifts. Finally, we discuss recent trends, open challenges, and promising directions for future research in DG-ReID. To the best of our knowledge, this is the first systematic survey dedicated to DG-ReID.
中文: 本文首次系统综述了无需目标域数据的域泛化行人重识别方法,重点分析了其架构组成和泛化模块,并探讨了该领域的未来研究方向。
English: This paper provides the first systematic survey of Domain-Generalizable Person Re-identification (DG-ReID), which focuses on learning domain-invariant features without target domain data, reviewing its architecture, generalization modules, and discussing future research directions.
Authors:Hongbi Zhou, Zhangkai Ni
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose scene-adaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a perceptual sensitivity-adaptive distribution to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code is publicly available at: https://github.com/eezkni/Perceptual-GS
中文: Perceptual-GS提出了一种感知自适应框架,通过优化高斯基元分布来提升重建质量和效率,在多个数据集上实现了领先性能。
English: Perceptual-GS introduces a perception-aware framework that adaptively optimizes Gaussian primitives for enhanced reconstruction quality and efficiency, achieving state-of-the-art results across multiple datasets.
Authors:Chong Li, Yingzhuo Deng, Jiajun Zhang, Chengqing Zong
Abstract:
The curse of multilinguality phenomenon is a fundamental problem of multilingual Large Language Models (LLMs), where the competition between massive languages results in inferior performance. It mainly comes from limited capacity and negative transfer between dissimilar languages. To address this issue, we propose a method to dynamically group and scale up the parameters of multilingual LLM while boosting positive transfer among similar languages. Specifically, the model is first tuned on monolingual corpus to determine the parameter deviation in each layer and quantify the similarity between languages. Layers with more deviations are extended to mixture-of-experts layers to reduce competition between languages, where one expert module serves one group of similar languages. Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters. Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.
中文: 多语言大模型中的“多语言诅咒”源于模型容量有限和语言间负迁移,通过动态分组语言并利用专家混合层扩展参数,有效减少了语言竞争,以更少参数显著提升了多语言性能。
English: The curse of multilinguality in LLMs, caused by limited capacity and negative transfer, is addressed by dynamically grouping languages and scaling parameters through mixture-of-experts layers, which reduces competition and enhances performance with fewer parameters.
Authors:Zichuan Fu, Xian Wu, Guojing Li, Yingying Zhang, Yefeng Zheng, Tianshi Ming, Yejing Wang, Wanyu Wang, Xiangyu Zhao
Abstract:
Large Language Models (LLMs) require continuous updates to maintain accurate and current knowledge as the world evolves. While existing knowledge editing approaches offer various solutions for knowledge updating, they often struggle with sequential editing scenarios and harm the general capabilities of the model, thereby significantly hampering their practical applicability. This paper proposes a two-stage framework combining robust supervised fine-tuning (R-SFT) with model merging for knowledge editing. Our method first fine-tunes the LLM to internalize new knowledge fully, then merges the fine-tuned model with the original foundation model to preserve newly acquired knowledge and general capabilities. Experimental results demonstrate that our approach significantly outperforms existing methods in sequential editing while better preserving the original performance of the model, all without requiring any architectural changes. Code is available at: https://github.com/Applied-Machine-Learning-Lab/MM4KE.
中文: 本文提出了一种结合鲁棒监督微调和模型合并的两阶段框架,能在不改变架构的情况下有效更新大语言模型的知识并保持其通用能力,在连续编辑任务中显著优于现有方法。
English: This paper introduces a two-stage framework that combines robust supervised fine-tuning with model merging to effectively update knowledge in Large Language Models while preserving their general capabilities, outperforming existing methods in sequential editing without architectural modifications.
Authors:Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao
Abstract:
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces Hierarchical Iterative Merging (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging's ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at: https://github.com/Applied-Machine-Learning-Lab/Hi-Merging.
中文: 本文提出Hi-Merging方法,通过分层剪枝和缩放将专业大语言模型统一为单一模型,在跨语言多任务学习中优于现有技术且无需额外训练。
English: This paper introduces Hi-Merging, a training-free method that unifies specialized LLMs into a single model through hierarchical pruning and scaling, demonstrating superior multi-task performance across languages compared to existing techniques.
Authors:Zhaochen Hong, Haofei Yu, Jiaxuan You
Abstract:
Evaluating consistency in large language models (LLMs) is crucial for ensuring reliability, particularly in complex, multi-step interactions between humans and LLMs. Traditional self-consistency methods often miss subtle semantic changes in natural language and functional shifts in code or equations, which can accumulate over multiple transformations. To address this, we propose ConsistencyChecker, a tree-based evaluation framework designed to measure consistency through sequences of reversible transformations, including machine translation tasks and AI-assisted programming tasks. In our framework, nodes represent distinct text states, while edges correspond to pairs of inverse operations. Dynamic and LLM-generated benchmarks ensure a fair assessment of the model's generalization ability and eliminate benchmark leakage. Consistency is quantified based on similarity across different depths of the transformation tree. Experiments on eight models from various families and sizes show that ConsistencyChecker can distinguish the performance of different models. Notably, our consistency scores-computed entirely without using WMT paired data-correlate strongly (r > 0.7) with WMT 2024 auto-ranking, demonstrating the validity of our benchmark-free approach. Our implementation is available at: https://github.com/ulab-uiuc/consistencychecker.
中文摘要:研究者提出了ConsistencyChecker这一基于树状结构的评估框架,通过可逆变换序列量化大语言模型的一致性,实验表明该无基准方法与传统评估指标具有高度相关性。
English Summary: The authors introduce ConsistencyChecker, a tree-based framework that evaluates LLM consistency through reversible transformations, demonstrating strong correlation with established benchmarks without requiring paired data.
Authors:Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Abstract:
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline. Our code is available at https://github.com/i2vec/MM-R5 .
中文: 本文提出MM-R5,一种基于强化学习的多模态推理增强重排器,通过生成高质量推理链提升文档检索效果,在基准测试中取得了领先性能。
English: The paper introduces MM-R5, a multimodal reasoning-enhanced reranker using reinforcement learning to improve document retrieval by generating high-quality reasoning chains and achieving state-of-the-art performance on benchmarks.
Authors:Yue Wan, Xiaowei Jia, Xiang Lorraine Li
Abstract:
Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of \textit{confirmation bias} in cognitive psychology. Specifically, we examine how model internal beliefs, approximated by direct question-answering probabilities, affect both reasoning generation ($Q \to R$) and reasoning-guided answer prediction ($QR \to A$) in CoT. By decomposing CoT into a two-stage process, we conduct a thorough correlation analysis in model beliefs, rationale attributes, and stage-wise performance. Our results provide strong evidence of confirmation bias in LLMs, such that model beliefs not only skew the reasoning process but also influence how rationales are utilized for answer prediction. Furthermore, the interplay between task vulnerability to confirmation bias and the strength of beliefs also provides explanations for CoT effectiveness across reasoning tasks and models. Overall, this study provides a valuable insight for the needs of better prompting strategies that mitigate confirmation bias to enhance reasoning performance. Code is available at \textit{https://github.com/yuewan2/biasedcot}.
Chinese: 本研究揭示了大型语言模型中的确认偏见会扭曲思维链提示中的推理过程和答案预测,解释了其在不同任务中效果不一致的原因,并提出了需要减少偏见的策略。
English: This study reveals that confirmation bias in large language models skews both the reasoning process and answer prediction in chain-of-thought prompting, explaining its inconsistent effectiveness across tasks and suggesting the need for bias-mitigating strategies.
Authors:Worasit Sangjan, Piyush Pandey, Norman B. Best, Jacob D. Washburn
Abstract:
Accurate identification of individual plants from unmanned aerial vehicle (UAV) images is essential for advancing high-throughput phenotyping and supporting data-driven decision-making in plant breeding. This study presents MatchPlant, a modular, graphical user interface-supported, open-source Python pipeline for UAV-based single-plant detection and geospatial trait extraction. MatchPlant enables end-to-end workflows by integrating UAV image processing, user-guided annotation, Convolutional Neural Network model training for object detection, forward projection of bounding boxes onto an orthomosaic, and shapefile generation for spatial phenotypic analysis. In an early-season maize case study, MatchPlant achieved reliable detection performance (validation AP: 89.6%, test AP: 85.9%) and effectively projected bounding boxes, covering 89.8% of manually annotated boxes with 87.5% of projections achieving an Intersection over Union (IoU) greater than 0.5. Trait values extracted from predicted bounding instances showed high agreement with manual annotations (r = 0.87-0.97, IoU >= 0.4). Detection outputs were reused across time points to extract plant height and Normalized Difference Vegetation Index with minimal additional annotation, facilitating efficient temporal phenotyping. By combining modular design, reproducibility, and geospatial precision, MatchPlant offers a scalable framework for UAV-based plant-level analysis with broad applicability in agricultural and environmental monitoring.
中文:MatchPlant是一个开源Python流程,通过集成无人机图像处理与深度学习模型,实现了端到端的单株植物检测和性状提取,在玉米案例中展现出高精度检测能力,为农业表型分析提供了可扩展的解决方案。
English: MatchPlant is an open-source Python pipeline that provides end-to-end workflows for accurate single-plant detection and trait extraction from UAV imagery, demonstrating high reliability in maize studies and enabling efficient temporal phenotyping for agricultural applications.
Authors:Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
Abstract:
A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).
中文: 本研究阐释了深度网络中的"顿悟"现象是通过网络雅可比矩阵与训练数据的对齐实现的,并提出了GrokAlign这一雅可比正则化方法,相比传统权重衰减等方法能显著加快顿悟过程。
English: This study explains that grokking in deep networks is achieved through the alignment of the network's Jacobians with training data, leading to the introduction of GrokAlign, a Jacobian regularization method that accelerates grokking compared to traditional approaches like weight decay.
Authors:Wei Wang, Wangyou Zhang, Chenda Li, Jiatong Shi, Shinji Watanabe, Yanmin Qian
Abstract:
Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we investigate a training framework that leverages a SQA model, trained to predict multiple evaluation metrics from a public SE leaderboard, as a supervisory signal for SE. This approach addresses a key limitation of conventional SE objectives, such as SI-SNR, which often fail to align with perceptual quality and generalize poorly across evaluation metrics. Moreover, it enables training on real-world data where clean references are unavailable. Experiments on both simulated and real-world test sets show that SQA-guided training consistently improves performance across a range of quality metrics. Code and checkpoints are available at https://github.com/urgent-challenge/urgent2026_challenge_track2
中文摘要:本研究提出了一种利用语音质量评估模型指导语音增强的训练框架,通过无需纯净参考数据即可提升多种指标性能,克服了传统目标的局限性。
English Summary: This study introduces a training framework that uses a speech quality assessment model to guide speech enhancement, overcoming the limitations of traditional objectives by improving performance across various metrics without requiring clean reference data.
Authors:Yijiang Li, Genpei Zhang, Jiacheng Cheng, Yi Li, Xiaojun Shan, Dashan Gao, Jiancheng Lyu, Yuan Li, Ning Bi, Nuno Vasconcelos
Abstract:
While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70-80% accuracy. Our code and data are available at https://github.com/williamium3000/ego-privacy.
Chinese: 本研究推出了首个用于评估第一人称视角隐私风险的大规模基准EgoPrivacy,揭示即使零样本情况下,也能从穿戴者视频中高精度推断其身份、人口统计和情境等隐私信息。
English: This study introduces EgoPrivacy, the first large-scale benchmark for assessing privacy risks in egocentric vision, revealing that wearers' private information like identity, demographics, and situations can be inferred with high accuracy from first-person videos, even in zero-shot scenarios.
Authors:Ella Miray Rajaonson, Mahyar Rajabi Kochi, Luis Martin Mejia Mendoza, Seyed Mohamad Moosavi, Benjamin Sanchez-Lengeling
Abstract:
Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub
中文: 本文提出CheMixHub分子混合物基准平台,涵盖11项性质预测任务和50万数据点,旨在推动混合物建模发展并加速化学制剂的配方优化与发现进程。
English: This paper introduces CheMixHub, a comprehensive benchmark for molecular mixtures featuring 11 property prediction tasks and 500k data points, designed to advance predictive modeling and accelerate chemical mixture development.
Authors:Yuan-Sen Ting
Abstract:
This textbook provides a systematic treatment of statistical machine learning for astronomical research through the lens of Bayesian inference, developing a unified framework that reveals connections between modern data analysis techniques and traditional statistical methods. We show how these techniques emerge from familiar statistical foundations. The consistently Bayesian perspective prioritizes uncertainty quantification and statistical rigor essential for scientific inference in astronomy. The textbook progresses from probability theory and Bayesian inference through supervised learning including linear regression with measurement uncertainties, logistic regression, and classification. Unsupervised learning topics cover Principal Component Analysis and clustering methods. We then introduce computational techniques through sampling and Markov Chain Monte Carlo, followed by Gaussian Processes as probabilistic nonparametric methods and neural networks within the broader statistical context. Our theory-focused pedagogical approach derives each method from first principles with complete mathematical development, emphasizing statistical insight and complementing with astronomical applications. We prioritize understanding why algorithms work, when they are appropriate, and how they connect to broader statistical principles. The treatment builds toward modern techniques including neural networks through a solid foundation in classical methods and their theoretical underpinnings. This foundation enables thoughtful application of these methods to astronomical research, ensuring proper consideration of assumptions, limitations, and uncertainty propagation essential for advancing astronomical knowledge in the era of large astronomical surveys.
这本教材通过贝叶斯推断为天文研究构建了统计机器学习的统一框架,从概率论基础延伸到神经网络等现代技术,始终强调不确定性量化和理论严谨性。
This textbook establishes a unified Bayesian framework for statistical machine learning in astronomy, progressing from foundational probability theory to modern techniques like neural networks while emphasizing uncertainty quantification and theoretical rigor.
Authors:Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson
Abstract:
Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9\% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1\% (mAP).
中文: 自监督音频模型因主要基于单声道音频训练而在处理现实多声道音频时表现不足,为此提出的SSLAM方法显著提升了多声道音频处理能力,同时保持了在标准基准测试中的优异性能。
English: Self-supervised audio models often underperform with polyphonic audio due to training on monophonic datasets, prompting the development of SSLAM, which enhances performance on complex audio while maintaining excellence on standard benchmarks.
Authors:Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, James Haworth
Abstract:
We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos in the Geograph dataset--a crowdsourced image archive spanning the British Isles, including remote regions lacking POIs and street-level imagery. Our approach addresses a Kaggle competition\footnote{https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos} task based on a subset of Geograph's 8M images, with strict evaluation: exact match accuracy is required across 49 possible tags. We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone. We release a lightweight pipeline\footnote{https://github.com/SpaceTimeLab/ClipTheLandscape} that trains on a modest laptop, using pre-trained CLIP image and text embeddings and a simple classification head. Predicted tags can support downstream tasks such as building location embedders for GeoAI applications, enriching spatial understanding in data-sparse regions.
中文: 本研究提出了一种基于CLIP的多模态分类器,通过融合位置和标题嵌入与图像特征,改进了景观照片的地理标签预测效果,其性能优于仅使用图像的方法,并为GeoAI应用提供了轻量级训练流程。
English: This study introduces a CLIP-based multi-modal classifier that enhances geographical tag prediction for landscape photos by integrating location and title embeddings with image features, outperforming image-only approaches and providing a lightweight training pipeline for GeoAI applications.
Authors:Wenyue Hua, Dujian Ding, Yile Gu, Yujie Ren, Kai Mei, Minghua Ma, William Yang Wang
Abstract:
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.
中文: 本文提出了一种针对大语言模型请求的语义调度算法,该算法根据任务内容的重要性确定优先级,并在医疗急救等关键应用中证明了其能有效缩短等待时间。
English: This paper introduces a semantic scheduling algorithm for large language model requests that prioritizes tasks based on their content significance, demonstrating its efficiency in minimizing waiting times for critical applications like medical emergencies.
Authors:Yujie Zhao, Zhijing Wu, Hejia Zhang, Zhongming Yu, Wentao Ni, Chia-Tung Ho, Haoxing Ren, Jishen Zhao
Abstract:
LLM-assisted hardware verification is gaining substantial attention due to its potential to significantly reduce the cost and effort of crafting effective testbenches. It also serves as a critical enabler for LLM-aided end-to-end hardware language design. However, existing current LLMs often struggle with Register Transfer Level (RTL) code generation, resulting in testbenches that exhibit functional errors in Hardware Description Languages (HDL) logic. Motivated by the strong performance of LLMs in Python code generation under inference-time sampling strategies, and their promising capabilities as judge agents, we propose PRO-V a fully program generation multi-agent system for robust RTL verification. Pro-V incorporates an efficient best-of-n iterative sampling strategy to enhance the correctness of generated testbenches. Moreover, it introduces an LLM-as-a-judge aid validation framework featuring an automated prompt generation pipeline. By converting rule-based static analysis from the compiler into natural language through in-context learning, this pipeline enables LLMs to assist the compiler in determining whether verification failures stem from errors in the RTL design or the testbench. PRO-V attains a verification accuracy of 87.17% on golden RTL implementations and 76.28% on RTL mutants. Our code is open-sourced at https://github.com/stable-lab/Pro-V.
Chinese: PRO-V 是一个多智能体系统,通过迭代采样和LLM作为评判的框架来增强RTL验证,提高测试平台的准确性并识别设计或测试平台错误,验证准确率最高达87.17%。
English: PRO-V is a multi-agent system that enhances RTL verification by using iterative sampling and an LLM-as-a-judge framework to improve testbench accuracy and identify design or testbench errors, achieving up to 87.17% verification accuracy.
Authors:Jackson Eshbaugh
Abstract:
Neural networks excel as function approximators, but their complexity often obscures the types of functions they learn, making it difficult to explain their behavior. To address this, the linearity score $λ(f)$ is introduced, a simple and interpretable diagnostic that quantifies how well a regression network's output can be mimicked by a linear model. Defined as the $R^2$ value between the network's predictions and those of a trained linear surrogate, $λ(f)$ measures linear decodability: the extent to which the network's behavior aligns with a structurally simple model. This framework is evaluated on both synthetic and real-world datasets, using dataset-specific networks and surrogates. High $λ(f)$ scores reliably indicate alignment with the network's outputs; however, they do not guarantee accuracy with respect to the ground truth. These results highlight the risk of using surrogate fidelity as a proxy for model understanding, especially in high-stakes regression tasks.
中文: 线性度评分λ(f)作为一种可解释的诊断工具,用于量化神经网络输出与线性替代模型的近似程度,但高分值并不保证与真实数据的准确性,警示在高风险回归任务中过度依赖替代模型保真度的风险。
English: The linearity score λ(f) is introduced as an interpretable diagnostic to measure how closely a neural network's outputs align with a linear surrogate model, though high scores don't guarantee accuracy with ground truth, cautioning against over-reliance on surrogate fidelity for model understanding.
Authors:Haoxiang Chen, Wei Zhao, Rufei Zhang, Nannan Li, Dongjin Li
Abstract:
In the context of multi-object tracking using video synthetic aperture radar (Video SAR), Doppler shifts induced by target motion result in artifacts that are easily mistaken for shadows caused by static occlusions. Moreover, appearance changes of the target caused by Doppler mismatch may lead to association failures and disrupt trajectory continuity. A major limitation in this field is the lack of public benchmark datasets for standardized algorithm evaluation. To address the above challenges, we collected and annotated 45 video SAR sequences containing moving targets, and named the Video SAR MOT Benchmark (VSMB). Specifically, to mitigate the effects of trailing and defocusing in moving targets, we introduce a line feature enhancement mechanism that emphasizes the positive role of motion shadows and reduces false alarms induced by static occlusions. In addition, to mitigate the adverse effects of target appearance variations, we propose a motion-aware clue discarding mechanism that substantially improves tracking robustness in Video SAR. The proposed model achieves state-of-the-art performance on the VSMB, and the dataset and model are released at https://github.com/softwarePupil/VSMB.
中文摘要:该研究提出了视频合成孔径雷达多目标跟踪基准(VSMB)数据集及新型跟踪模型,通过线特征增强和运动感知线索丢弃机制,有效缓解多普勒效应造成的伪影和目标外观变化,实现了最先进的性能。
English Summary: The study introduces a Video SAR MOT Benchmark (VSMB) dataset and a novel tracking model that uses line feature enhancement and motion-aware clue discarding to mitigate Doppler-induced artifacts and appearance changes, achieving state-of-the-art performance.
Authors:Wanjin Feng, Xingyu Gao, Wenqian Du, Hailong Shi, Peilin Zhao, Pengcheng Wu, Chunyan Miao
Abstract:
Spiking Neural Networks (SNNs) often suffer from high time complexity $O(T)$ due to the sequential processing of $T$ spikes, making training computationally expensive.
In this paper, we propose a novel Fixed-point Parallel Training (FPT) method to accelerate SNN training without modifying the network architecture or introducing additional assumptions.
FPT reduces the time complexity to $O(K)$, where $K$ is a small constant (usually $K=3$), by using a fixed-point iteration form of Leaky Integrate-and-Fire (LIF) neurons for all $T$ timesteps.
We provide a theoretical convergence analysis of FPT and demonstrate that existing parallel spiking neurons can be viewed as special cases of our proposed method.
Experimental results show that FPT effectively simulates the dynamics of original LIF neurons, significantly reducing computational time without sacrificing accuracy.
This makes FPT a scalable and efficient solution for real-world applications, particularly for long-term tasks.
Our code will be released at \href{https://github.com/WanjinVon/FPT}{\texttt{https://github.com/WanjinVon/FPT}}.
中文摘要:本文提出的定点并行训练(FPT)方法通过将LIF神经元重构为定点迭代形式,将SNN训练时间复杂度从O(T)降至O(K),在保持精度的同时大幅提升计算效率。
English Summary: The proposed Fixed-point Parallel Training (FPT) method reduces SNN training time complexity from O(T) to O(K) by reformulating LIF neurons into fixed-point iterations, achieving comparable accuracy with significantly faster computation.
Authors:Shaba Shaon, Van-Dinh Nguyen, Dinh C. Nguyen
Abstract:
In this paper, we study a novel latency minimization problem in wireless federated learning (FL) across multi-hop networks. The system comprises multiple routes, each integrating leaf and relay nodes for FL model training. We explore a personalized learning and adaptive aggregation-aware FL (PAFL) framework that effectively addresses data heterogeneity across participating nodes by harmonizing individual and collective learning objectives. We formulate an optimization problem aimed at minimizing system latency through the joint optimization of leaf and relay nodes, as well as relay routing indicator. We also incorporate an additional energy harvesting scheme for the relay nodes to help with their relay tasks. This formulation presents a computationally demanding challenge, and thus we develop a simple yet efficient algorithm based on block coordinate descent and successive convex approximation (SCA) techniques. Simulation results illustrate the efficacy of our proposed joint optimization approach for leaf and relay nodes with relay routing indicator. We observe significant latency savings in the wireless multi-hop PAFL system, with reductions of up to 69.37% compared to schemes optimizing only one node type, traditional greedy algorithm, and scheme without relay routing indicator.
中文: 本文提出一种面向多跳网络的个性化联邦学习框架,通过联合优化叶节点、中继节点及路由指示器来最小化系统延迟,基于块坐标下降和逐次凸逼近技术的高效算法实现了高达69.37%的延迟降低。
English: This paper introduces a personalized FL framework for multi-hop networks that jointly optimizes leaf and relay nodes with routing indicators to minimize latency, achieving up to 69.37% reduction through an efficient algorithm based on block coordinate descent and SCA techniques.
Authors:Nirmal Gelal, Chloe Snow, Ambyr Rios, Hande Küçük McGinty
Abstract:
The implementation of transformational pedagogy in secondary education classrooms requires a broad multiliteracy approach. Due to limited planning time and resources, high school English Literature teachers often struggle to curate diverse, thematically aligned literature text sets. This study addresses the critical need for a tool that provides scaffolds for novice educators in selecting literature texts that are diverse -- in terms of genre, theme, subtheme, and author -- yet similar in context and pedagogical merits. We have developed a recommendation system, Teaching Text Expansion for Teacher Scaffolding (T-TExTS), that suggests high school English Literature books based on pedagogical merits, genre, and thematic relevance using a knowledge graph. We constructed a domain-specific ontology using the KNowledge Acquisition and Representation Methodology (KNARM), transformed into a knowledge graph, which was then embedded using DeepWalk, biased random walk, and a hybrid of both approaches. The system was evaluated using link prediction and recommendation performance metrics, including Area Under the Curve (AUC), Mean Reciprocal Rank (MRR), Hits@K, and normalized Discounted Cumulative Gain (nDCG). DeepWalk outperformed in most ranking metrics, with the highest AUC (0.9431), whereas the hybrid model offered balanced performance. These findings demonstrate the importance of semantic, ontology-driven approaches in recommendation systems and suggest that T-TExTS can significantly ease the burden of English Literature text selection for high school educators, promoting more informed and inclusive curricular decisions. The source code for T-TExTS is available at: https://github.com/koncordantlab/TTExTS
中文摘要:T-TExTS推荐系统通过知识图谱和语义嵌入技术,帮助高中英语教师高效筛选体裁多样且主题契合的文学作品,其中DeepWalk算法在评估中表现出最优性能指标。
English Summary: The T-TExTS recommendation system uses a knowledge graph and semantic embeddings to help high school English teachers efficiently select diverse, thematically aligned literature, with DeepWalk achieving the best performance metrics in evaluation.
Authors:Joydeep Chandra, Aleksandr Algazinov, Satyam Kumar Navneet, Rim El Filali, Matt Laing, Andrew Hanna
Abstract:
In the age of open and free information, a concerning trend of reliance on AI is emerging. However, existing AI tools struggle to evaluate the credibility of information and to justify their assessments. Hence, there is a growing need for systems that can help users evaluate the trustworthiness of online information. Although major search engines incorporate AI features, they often lack clear reliability indicators. We present TrueGL, a model that makes trustworthy search results more accessible. The model is a fine-tuned version of IBM's Granite-1B, trained on the custom dataset and integrated into a search engine with a reliability scoring system. We evaluate the system using prompt engineering and assigning each statement a continuous reliability score from 0.1 to 1, then instructing the model to return a textual explanation alongside the score. Each model's predicted scores are measured against real scores using standard evaluation metrics. TrueGL consistently outperforms other small-scale LLMs and rule-based approaches across all experiments on key evaluation metrics, including MAE, RMSE, and R2. The model's high accuracy, broad content coverage, and ease of use make trustworthy information more accessible and help reduce the spread of false or misleading content online. Our code is publicly available at https://github.com/AlgazinovAleksandr/TrueGL, and our model is publicly released at https://huggingface.co/JoydeepC/trueGL.
中文摘要:TrueGL是一款经过优化的AI模型,旨在通过为搜索结果提供连续可信度评分和解释来提升在线信息的可靠性,在准确性和易用性方面优于其他模型。
English Summary: TrueGL is a fine-tuned AI model designed to enhance online information reliability by providing search results with continuous trust scores and explanations, outperforming other models in accuracy and accessibility.
Authors:Joydeep Chandra, Aleksandr Algazinov, Satyam Kumar Navneet, Rim El Filali, Matt Laing, Andrew Hanna
Abstract:
In the age of open and free information, a concerning trend of reliance on AI is emerging. However, existing AI tools struggle to evaluate the credibility of information and to justify their assessments. Hence, there is a growing need for systems that can help users evaluate the trustworthiness of online information. Although major search engines incorporate AI features, they often lack clear reliability indicators. We present TrueGL, a model that makes trustworthy search results more accessible. The model is a fine-tuned version of IBM's Granite-1B, trained on the custom dataset and integrated into a search engine with a reliability scoring system. We evaluate the system using prompt engineering and assigning each statement a continuous reliability score from 0.1 to 1, then instructing the model to return a textual explanation alongside the score. Each model's predicted scores are measured against real scores using standard evaluation metrics. TrueGL consistently outperforms other small-scale LLMs and rule-based approaches across all experiments on key evaluation metrics, including MAE, RMSE, and R2. The model's high accuracy, broad content coverage, and ease of use make trustworthy information more accessible and help reduce the spread of false or misleading content online. Our code is publicly available at https://github.com/AlgazinovAleksandr/TrueGL, and our model is publicly released at https://huggingface.co/JoydeepC/trueGL.
中文摘要:TrueGL是一款经过优化的AI模型,旨在通过为搜索结果提供连续可信度评分和解释来提升在线信息的可靠性,在准确性和易用性方面优于其他模型。
English Summary: TrueGL is a fine-tuned AI model designed to enhance online information reliability by providing search results with continuous trust scores and explanations, outperforming other models in accuracy and accessibility.
Authors:Yewei Liu, Xiyuan Wang, Muhan Zhang
Abstract:
Network pruning, aimed at reducing network size while preserving accuracy, has attracted significant research interest. Numerous pruning techniques have been proposed over time. They are becoming increasingly effective, but more complex and harder to interpret as well. Given the inherent complexity of neural networks, we argue that manually designing pruning criteria has reached a bottleneck. To address this, we propose a novel approach in which we "use a neural network to prune neural networks". More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically which can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and the standard finetuning to prune at state-of-the-art. Our method achieved outstanding results on many popular and representative pruning tasks (including ResNet56 on CIFAR10, VGG19 on CIFAR100, ResNet50 on ImageNet). Our code is available at https://github.com/Yewei-Liu/MetaPruning
Chinese: 我们提出了一种全新的元学习网络剪枝框架,通过元网络自动学习复杂剪枝规则,无需针对每个任务进行专门训练即可在各种网络上实现最先进的剪枝效果。
English: We introduce a novel meta-learning framework for network pruning that automatically learns complex pruning rules through a metanetwork, achieving state-of-the-art results across various networks without requiring special training for each task.
Authors:Yewei Liu, Xiyuan Wang, Muhan Zhang
Abstract:
We propose an entirely new meta-learning framework for network pruning. It is a general framework that can be theoretically applied to almost all types of networks with all kinds of pruning and has great generality and transferability. Experiments have shown that it can achieve outstanding results on many popular and representative pruning tasks (including both CNNs and Transformers). Unlike all prior works that either rely on fixed, hand-crafted criteria to prune in a coarse manner, or employ learning to prune ways that require special training during each pruning and lack generality. Our framework can learn complex pruning rules automatically via a neural network (metanetwork) and has great generality that can prune without any special training. More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically and can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and some standard finetuning to prune at state-of-the-art. Our code is available at https://github.com/Yewei-Liu/MetaPruning.
Chinese: 我们提出了一种全新的元学习网络剪枝框架,通过元网络自动学习复杂剪枝规则,无需针对每个任务进行专门训练即可在各种网络上实现最先进的剪枝效果。
English: We introduce a novel meta-learning framework for network pruning that automatically learns complex pruning rules through a metanetwork, achieving state-of-the-art results across various networks without requiring special training for each task.
Authors:Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo
Abstract:
Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages adaptive weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. Our approach incorporates two key innovations: (1) a Learnable Transformation that optimizes invertible scaling and rotation matrices to align binarized weights with full-precision distributions, enabling incoherence processing to enhance layer-wise representation quality; (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters, compressing them into compact indices with tailored distance metrics and sign-based centroid updates. This eliminates the need for sparse masks, enabling efficient inference on standard hardware. Our code is available at https://github.com/Chooovy/BTC-LLM.
中文:BTC-LLM是一种新颖的亚1比特量化框架,通过自适应权重变换和二进制模式聚类克服了性能和硬件限制,无需稀疏掩码即可实现卓越效率。
English: BTC-LLM is a novel sub-1-bit quantization framework that overcomes performance and hardware limitations through adaptive weight transformation and binary pattern clustering, achieving superior efficiency without sparse masks.
Authors:Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen
Abstract:
Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.
中文:大型语言模型正显著影响现实世界的代码风格,通过对数万个GitHub仓库的分析,发现命名规范等编程风格正呈现与AI生成代码一致的变化趋势。
English: Large Language Models are measurably influencing real-world coding styles, as evidenced by trends in naming conventions and other style elements across thousands of GitHub repositories.
Authors:Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schnürer, Johannes Brandstetter, Werner Zellinger
Abstract:
Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift
神经PDE代理模型在未见配置下表现不佳,而领域自适应技术可利用目标域的参数化数据(无需真实模拟数据)提升其泛化能力。
Neural PDE surrogates struggle with unseen configurations, but domain adaptation techniques can improve their generalization using parametric data from target domains without ground truth simulations.
Authors:Korbinian Pöppel, Richard Freinschlag, Thomas Schmied, Wei Lin, Sepp Hochreiter
Abstract:
Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a higher level structure, like 2D grids, trees, and directed acyclic graphs (DAGs). In this work, we extend the notion of multi-dimensionality to linear RNNs. We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates that act on the line graph of a general DAG. This enables parallelization in analogy to parallel associative scans and the chunkwise-recurrent form of sequential linear RNNs, but for DAGs. For regular grids (1D and 2D), like images, this scheme can be efficiently implemented using einsum operations, concatenations, and padding in logarithmic time. pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes: a directed propagation mode (P-mode) and a diffusive distribution mode (D-mode). To showcase the long-range capabilities of pLSTM, we introduce arrow-pointing extrapolation as a synthetic computer vision task that contains long-distance directional information. We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate. On established molecular graph and computer vision benchmarks, pLSTMs also show strong performance. Code and Datasets are available at: https://github.com/ml-jku/plstm_experiments.
Chinese Summary: 本研究提出了可并行化的线性源转换标记网络(pLSTMs),将多维线性RNNs扩展到有向无环图处理,通过并行计算解决长距离依赖问题,在需要方向信息的任务和泛化能力上优于Transformer模型。
English Summary: The study introduces parallelizable Linear Source Transition Mark networks (pLSTMs), which extend multi-dimensional linear RNNs to handle directed acyclic graphs with parallel processing and address long-range dependency issues, outperforming Transformers in tasks requiring directional information and generalization.
Authors:Yue Yao, Zelin Wen, Yan Tong, Xinyu Tian, Xuqing Li, Xiao Ma, Dongliang Xu, Tom Gedeon
Abstract:
Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model's inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal.
中文: 思维图遍历框架通过整合医学先验知识和动态推理预算,使冻结的视觉语言模型无需重新训练即可生成更准确的胸部X光报告。
English: The Thought Graph Traversal framework enhances radiology report generation by integrating medical priors and dynamic reasoning budgets, enabling frozen vision-language models to produce more accurate chest X-ray reports without retraining.
Authors:Wuzhenghong Wen, Su Pan, yuwei Sun
Abstract:
Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for ground truth schema linking outcomes while compromising reasoning ability. This limitation arises because of the difficulty in acquiring a high-quality reasoning sample for downstream tasks. To address this, we propose Schema-R1, a reasoning schema linking model trained using reinforcement learning. Specifically, Schema-R1 consists of three key steps: constructing small batches of high-quality reasoning samples, supervised fine-tuning for cold-start initialization, and rule-based reinforcement learning training. The final results demonstrate that our method effectively enhances the reasoning ability of the schema linking model, achieving a 10\% improvement in filter accuracy compared to the existing method. Our code is available at https://github.com/hongWin/Schema-R1/.
中文摘要:提出的Schema-R1模型通过强化学习改进Text-to-SQL任务中的模式链接,解决了现有方法死记硬背的局限,将筛选准确率提升了10%。
English Summary: The proposed Schema-R1 model uses reinforcement learning to enhance schema linking in Text-to-SQL tasks, achieving a 10% improvement in filter accuracy by addressing the rote-learning limitations of current methods.
Authors:Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Abstract:
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense
中文: 本研究提出了一种基于对比表征学习的大语言模型防御框架,通过三元组损失和对抗性负样本挖掘增强模型鲁棒性,在保持标准性能的同时有效抵御输入级和嵌入空间攻击。
English: This study introduces a contrastive representation learning framework for defending Large Language Models against adversarial attacks, utilizing triplet loss and adversarial mining to enhance robustness without sacrificing standard performance.
Authors:Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong
Abstract:
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.
中文摘要:TreeRL提出了一种基于策略树搜索的强化学习框架,通过策略性分支和中间监督提升推理性能,无需单独训练奖励模型。
English Summary: TreeRL introduces an on-policy tree search reinforcement learning framework that enhances reasoning performance through strategic branching and intermediate supervision, eliminating the need for separate reward models.
Authors:Zhangkai Ni, Yang Zhang, Wenhan Yang, Hanli Wang, Shiqi Wang, Sam Kwong
Abstract:
Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: https://github.com/eezkni/SSIU
中文: 提出的结构相似性启发展开(SSIU)方法融合了数据驱动和模型驱动策略,通过专门模块设计实现了高效图像超分辨率,以更少参数和更低内存消耗达到了当前最优性能。
English: The proposed Structural Similarity-Inspired Unfolding (SSIU) method combines data-driven and model-driven approaches for efficient image super-resolution, achieving state-of-the-art performance with fewer parameters and lower memory consumption through specialized modules.
Authors:Maximilian Kreutner, Marlene Lutz, Markus Strohmaier
Abstract:
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.
中文摘要:研究表明,零样本角色提示能有效模拟个人投票决策并预测欧洲群体的政策立场,对欧洲议会议员投票行为的模拟加权F1分数达到约0.793。
English Summary: This study demonstrates that zero-shot persona prompting can effectively simulate individual voting decisions and predict policy positions of European groups, achieving a weighted F1 score of approximately 0.793 for simulating European Parliament members' voting behavior.
Authors:Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
Abstract:
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.
Chinese: 本研究提出了大语言模型中长度泛化的新视角,强调输出分布的长短对齐重要性,通过设计量化指标和正则化方法,有效提升了长上下文任务的性能表现。
English: This research introduces a novel perspective on length generalization in large language models by emphasizing the importance of long-short alignment in output distributions, proposing a metric and regularization method that significantly improves performance on long-context tasks.
Authors:Muhammad Sarmad, Arnt-Børre Salberg, Michael Kampffmeyer
Abstract:
This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This study underscores the power of harmonized learning with generative priors and fusion strategies to create a modular framework for Sentinel-2 SR. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.
中文:DiffFuSR是一个模块化流程,通过扩散模型和融合网络将哨兵2号影像超分辨率提升至2.5米,在保真度和光谱一致性上优于现有方法。
English: DiffFuSR is a modular pipeline that super-resolves Sentinel-2 imagery to 2.5 meters using a diffusion-based model and fusion network, outperforming existing methods in fidelity and spectral consistency.
Authors:Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
Abstract:
Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.
中文: 深度研究代理利用大语言模型自动化复杂研究任务,但缺乏系统性评估基准的问题通过DeepResearch Bench得以解决,该基准提供100个专家设计的任务和与人类判断高度一致的新型评估方法。
English: Deep Research Agents leverage LLMs to automate complex research tasks, but the lack of a comprehensive benchmark is addressed by DeepResearch Bench, which offers 100 expert-designed tasks and novel evaluation methods aligned with human judgment.
Authors:Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin
Abstract:
This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.
Chinese: 该研究展示了LLaVA-NeXT-Interleave在三个任务的22个数据集上的优异表现,其中标准模型在视觉密集型任务中表现突出,而DCI增强版本在语义连贯性和结构化变化理解方面更具优势。
English: The study showcases LLaVA-NeXT-Interleave's strong performance across 22 datasets in three tasks, with the standard model excelling in vision-heavy tasks while the DCI-enhanced version performs better on semantic coherence and structured change understanding.
Authors:VÃctor Gallego
Abstract:
Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
中文: 本文提出可配置偏好调优(CPT)框架,通过基于结构化准则生成合成偏好数据,使语言模型能够根据人类可解释的指令动态调整输出行为,突破了传统方法对静态偏好的依赖,无需重新训练即可实现细粒度控制。
English: This paper introduces Configurable Preference Tuning (CPT), a framework that enables language models to dynamically adapt their behavior using human-interpretable directives, overcoming the static preference limitations of methods like DPO by leveraging rubric-guided synthetic data for fine-grained control without retraining.
Authors:Libin Lan, Hongxing Li, Zunhui Xia, Yudong Zhang
Abstract:
Incomplete multi-modal medical image segmentation faces critical challenges from modality imbalance, including imbalanced modality missing rates and heterogeneous modality contributions. Due to their reliance on idealized assumptions of complete modality availability, existing methods fail to dynamically balance contributions and neglect the structural relationships between modalities, resulting in suboptimal performance in real-world clinical scenarios. To address these limitations, we propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net). The DMAF-Net adopts three key ideas. First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference by combining transformer attention with adaptive masking and weight modality contributions dynamically through attention maps. Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment via covariance consistency and masked graph attention, while ensuring semantic consistency through cross-modal class-specific prototype alignment. Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates by tracking distillation gaps in real-time, and to balance convergence speeds across modalities by adaptively reweighting losses and scaling gradients. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Our code is available at https://github.com/violet-42/DMAF-Net.
中文: 提出的动态模态感知融合网络(DMAF-Net)通过动态融合、双重蒸馏和自适应训练策略,解决了不完整多模态医学图像分割中的模态不平衡问题,在多个基准数据集上超越了现有方法。
English: The proposed Dynamic Modality-Aware Fusion Network (DMAF-Net) addresses modality imbalance in incomplete multi-modal medical image segmentation through dynamic fusion, dual distillation, and adaptive training strategies, outperforming existing methods on benchmark datasets.
Authors:Libin Lan, Hongxing Li, Zunhui Xia, Juan Zhou, Xiaofei Zhu, Yongmei Li, Yudong Zhang, Xin Luo
Abstract:
Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.
中文: 本文提出跨模态聚类引导负采样方法,通过改进负样本选择和利用跨模态注意力保留细粒度细节,显著提升医学视觉表征学习效果,在多项下游任务中超越现有最佳方法。
English: This paper introduces a Cross-Modal Cluster-Guided Negative Sampling method that enhances medical visual representation learning by improving negative sample selection and preserving fine-grained details through cross-modal attention and masked image reconstruction, achieving superior performance across multiple medical tasks.
Authors:Yunhan Ren, Ruihuang Li, Lingbo Liu, Changwen Chen
Abstract:
Instance segmentation of prohibited items in security X-ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X-ray images and natural objects, as well as the severe overlapping among objects in X-ray images. To address these issues, we propose an occlusion-aware instance segmentation pipeline designed to identify prohibited items in X-ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero-shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion-aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large-scale X-ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion-annotated datasets, PIDray-A and PIXray-A. Extensive experimental results on these occlusion-annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: https://github.com/Ryh1218/Occ
Chinese: 本文提出了一种遮挡感知的实例分割流程,结合SAM模型和双层掩码解码器,有效解决了X光图像中违禁物品识别因遮挡和外观差异带来的难题,并在新标注的数据集PIDray-A和PIXray-A上验证了其优越性能。
English: This paper introduces an occlusion-aware instance segmentation pipeline that leverages the Segment Anything Model (SAM) and a bilayer mask decoder to address the challenges of identifying prohibited items in X-ray images, validated on newly annotated datasets PIDray-A and PIXray-A.
Authors:Emre Kavak, Tom Nuno Wolf, Christian Wachinger
Abstract:
Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in black-box models. Across five diverse datasets, our methods consistently outperform or are competitive in existing bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/***.
中文: 本文提出因果框架和高效估计器来解决深度学习中的数据集偏差问题,在多种数据集上以最少超参数实现了稳健性能。
English: This paper introduces a causal framework and efficient estimators to mitigate dataset bias in deep learning, achieving robust performance across diverse datasets with minimal hyperparameters.
Authors:Shashank Balla
Abstract:
The widespread adoption of outsourced neural network inference presents significant privacy challenges, as sensitive user data is processed on untrusted remote servers. Secure inference offers a privacy-preserving solution, but existing frameworks suffer from high computational overhead and communication costs, rendering them impractical for real-world deployment. We introduce SecONNds, a non-intrusive secure inference framework optimized for large ImageNet-scale Convolutional Neural Networks. SecONNds integrates a novel fully Boolean Goldreich-Micali-Wigderson (GMW) protocol for secure comparison -- addressing Yao's millionaires' problem -- using preprocessed Beaver's bit triples generated from Silent Random Oblivious Transfer. Our novel protocol achieves an online speedup of 17$\times$ in nonlinear operations compared to state-of-the-art solutions while reducing communication overhead. To further enhance performance, SecONNds employs Number Theoretic Transform (NTT) preprocessing and leverages GPU acceleration for homomorphic encryption operations, resulting in speedups of 1.6$\times$ on CPU and 2.2$\times$ on GPU for linear operations. We also present SecONNds-P, a bit-exact variant that ensures verifiable full-precision results in secure computation, matching the results of plaintext computations. Evaluated on a 37-bit quantized SqueezeNet model, SecONNds achieves an end-to-end inference time of 2.8 s on GPU and 3.6 s on CPU, with a total communication of just 420 MiB. SecONNds' efficiency and reduced computational load make it well-suited for deploying privacy-sensitive applications in resource-constrained environments. SecONNds is open source and can be accessed from: https://github.com/shashankballa/SecONNds.
中文:SecONNds提出了一种针对大型神经网络优化的高效安全推理框架,通过创新协议和GPU加速显著降低了计算和通信开销,同时确保隐私保护。
English: SecONNds introduces an efficient secure inference framework optimized for large neural networks, significantly reducing computational and communication overhead while ensuring privacy through novel protocols and GPU acceleration.
Authors:Xiaoyu Ma, Hao Chen, Yongjian Deng
Abstract:
Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50%$\uparrow$ on CREMAD and 3.41%$\uparrow$ on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at https://github.com/MatthewMaxy/Remix_ICML2025.
中文: 本文提出多模态数据重组方法,通过解耦和重组数据解决模态不平衡问题并增强单模态学习,在不增加计算成本的情况下显著提升了准确率。
English: This paper introduces multimodal Data Remixing, a method that decouples and reassembles data to address modality imbalance and enhance unimodal learning, achieving significant accuracy improvements without extra computational costs.
Authors:Akshay Jindal, Nabil Sadaka, Manu Mathew Thomas, Anton Sochenov, Anton Kaplanyan
Abstract:
While existing video and image quality datasets have extensively studied natural videos and traditional distortions, the perception of synthetic content and modern rendering artifacts remains underexplored. We present a novel video quality dataset focused on distortions introduced by advanced rendering techniques, including neural supersampling, novel-view synthesis, path tracing, neural denoising, frame interpolation, and variable rate shading. Our evaluations show that existing full-reference quality metrics perform sub-optimally on these distortions, with a maximum Pearson correlation of 0.78. Additionally, we find that the feature space of pre-trained 3D CNNs aligns strongly with human perception of visual quality. We propose CGVQM, a full-reference video quality metric that significantly outperforms existing metrics while generating both per-pixel error maps and global quality scores. Our dataset and metric implementation is available at https://github.com/IntelLabs/CGVQM.
Chinese: 本文针对现代渲染失真提出了一种新型视频质量数据集,并开发了CGVQM全参考质量评估方法,其性能显著优于现有指标。
English: This paper introduces a novel video quality dataset for modern rendering distortions and proposes CGVQM, a full-reference metric that surpasses existing methods in performance.
Authors:Zhaoyang Wang, Jie Li, Wen Lu, Lihuo He, Maoguo Gong, Xinbo Gao
Abstract:
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequate for addressing current video super-resolution (VSR) demands. To overcome these challenges, we propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data. Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames. The proposed modular architecture is designed for seamless integration with existing VSR frameworks, ensuring strong adaptability and transferability across diverse applications. Experimental results demonstrate that our method achieves performance on par with, or surpassing, the current SOTA models, while significantly reducing inference time. By addressing key bottlenecks in CVSR, our work offers a practical and efficient pathway for advancing VSR technology. Our code will be publicly available at https://github.com/handsomewzy/FCA2.
中文摘要:我们提出的压缩驱动降维方法在性能上达到或超越了当前最先进的压缩视频超分辨率模型,同时显著减少了推理时间并增强了帧间时序信息提取能力。
English Summary: Our proposed method introduces a compression-driven dimensionality reduction strategy that achieves performance comparable to or better than state-of-the-art compressed video super-resolution models while significantly reducing inference time and enhancing temporal information extraction.
Authors:Heng Fang, Hossein Azizpour
Abstract:
Climate change is leading to an increase in extreme weather events, causing significant environmental damage and loss of life. Early detection of such events is essential for improving disaster response. In this work, we propose SITS-Extreme, a novel framework that leverages satellite image time series to detect extreme events by incorporating multiple pre-disaster observations. This approach effectively filters out irrelevant changes while isolating disaster-relevant signals, enabling more accurate detection. Extensive experiments on both real-world and synthetic datasets validate the effectiveness of SITS-Extreme, demonstrating substantial improvements over widely used strong bi-temporal baselines. Additionally, we examine the impact of incorporating more timesteps, analyze the contribution of key components in our framework, and evaluate its performance across different disaster types, offering valuable insights into its scalability and applicability for large-scale disaster monitoring.
中文: SITS-Extreme框架通过整合多时相卫星影像与灾前观测数据,能有效过滤无关变化并提取灾害信号,在各类灾害检测实验中显著优于双时相基线方法,为大规模灾害监测提供了可扩展的解决方案。
English: The SITS-Extreme framework utilizes satellite image time series with multiple pre-disaster observations to accurately detect extreme weather events by filtering irrelevant changes and isolating disaster signals, demonstrating significant improvements over bi-temporal baselines in experiments across various disaster types.
Authors:Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, Yunhong Wang
Abstract:
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.
中文摘要:本文提出FIMA-Q这一新型视觉Transformer后训练量化方法,通过快速Fisher信息矩阵近似和分块重建框架解决低比特量化精度损失问题,在多项视觉任务中显著超越现有最优方法。
English Summary: This paper introduces FIMA-Q, a novel post-training quantization method for Vision Transformers that addresses accuracy degradation in low-bit settings by leveraging a fast Fisher Information Matrix approximation and block-wise reconstruction, achieving superior performance across various tasks.
Authors:Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, Steven Peters, Andrea Stocco, Bassam Alrifaee, Marco Pavone, Johannes Betz
Abstract:
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
中文摘要:本综述探讨了基础模型如何通过统一分类法和评估指标,生成和分析多样化、真实的驾驶场景,以克服传统方法的局限,从而提升自动驾驶系统的能力。
English Summary: This survey explores how foundation models enhance autonomous driving by generating and analyzing diverse, realistic driving scenarios, addressing limitations of traditional methods through a unified taxonomy and evaluation metrics.
Authors:Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan
Abstract:
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.
中文: 提出的Manager插件通过自适应整合多层级单模态专家知识,在双塔视觉语言模型和多模态大语言模型中均实现了性能提升,并与多网格算法形成互补,增强了视觉细节的捕捉能力。
English: The proposed Manager plugin enhances Two-Tower Vision-Language Models by adaptively integrating multi-level unimodal expertise, achieving superior performance across VL tasks and MLLM architectures while complementing multi-grid algorithms for richer visual representation.
Authors:Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, Fan Yang
Abstract:
Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces \textbf{DSP+}, an improved version of the Draft, Sketch, and Prove framework, featuring a \emph{fine-grained and integrated} neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80.7\%, 32.8\%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves \texttt{imo\_2019\_p1}, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.
中文: 本文提出DSP+框架,通过精细整合神经符号方法优化草案、草图和证明阶段,无需额外训练即可实现顶尖的自动定理证明性能,并发现人类可理解的证明模式。
English: This paper introduces DSP+, an enhanced neuro-symbolic framework that achieves state-of-the-art automated theorem proving results without additional training by refining draft, sketch, and prove phases with integrated reasoning and symbolic methods.
Authors:Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu
Abstract:
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.
中文摘要:研究发现多模态学习中的自注意力机制常无法动态适应不同模态质量,反而偏好单一模态形成自我强化偏差,但提出的滚动查询(RollingQ)方法通过轮换查询平衡注意力分配,成功恢复了模型的动态适应性。
English Summary: The study reveals that self-attention mechanisms in multimodal learning often fail to dynamically adapt to varying modality qualities, instead favoring one modality and creating a self-reinforcing bias, but proposes Rolling Query (RollingQ) to restore adaptability by rotating queries to balance attention allocation.
Authors:Abhishek Tyagi, Arjun Iyer, William H Renninger, Christopher Kanan, Yuhao Zhu
Abstract:
Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers. Our source code is available at https://github.com/horizon-research/DynaDiag/.
中文: DynaDiag提出了一种采用对角线稀疏模式的结构化稀疏训练方法,在保持与无结构化方法同等精度的同时,通过硬件友好的实现获得了显著的计算加速效果。
English: DynaDiag introduces a structured sparse training method using diagonal sparsity patterns that matches the accuracy of unstructured approaches while achieving significant computational speedups through hardware-friendly implementations.
Authors:Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, Ari Holtzman
Abstract:
Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).
中文: 大型语言模型在检测文档中缺失信息方面表现不佳,新的AbsenceBench测试显示,Transformer注意力机制存在根本性局限,无法处理信息空白。
English: Large language models struggle to detect missing information in documents, as demonstrated by their poor performance on the new AbsenceBench test, which reveals a fundamental limitation in Transformer attention mechanisms that cannot process information gaps.
Authors:Jie Zhu, Leye Wang
Abstract:
Text-to-image diffusion model since its propose has significantly influenced the content creation due to its impressive generation capability. However, this capability depends on large-scale text-image datasets gathered from web platforms like social media, posing substantial challenges in copyright compliance and personal privacy leakage. Though there are some efforts devoted to explore approaches for auditing data provenance in text-to-image diffusion models, existing work has unrealistic assumptions that can obtain model internal knowledge, e.g., intermediate results, or the evaluation is not reliable. To fill this gap, we propose a completely black-box auditing framework called Feature Semantic Consistency-based Auditing (FSCA). It utilizes two types of semantic connections within the text-to-image diffusion model for auditing, eliminating the need for access to internal knowledge. To demonstrate the effectiveness of our FSCA framework, we perform extensive experiments on LAION-mi dataset and COCO dataset, and compare with eight state-of-the-art baseline approaches. The results show that FSCA surpasses previous baseline approaches across various metrics and different data distributions, showcasing the superiority of our FSCA. Moreover, we introduce a recall balance strategy and a threshold adjustment strategy, which collectively allows FSCA to reach up a user-level accuracy of 90% in a real-world auditing scenario with only 10 samples/user, highlighting its strong auditing potential in real-world applications. Our code is made available at https://github.com/JiePKU/FSCA.
Chinese: 提出的FSCA框架通过利用语义一致性实现了文本到图像扩散模型的黑盒审计,无需访问模型内部知识,并在实际应用中展现出卓越的准确性。
English: The proposed FSCA framework enables black-box auditing of text-to-image diffusion models by leveraging semantic consistency, eliminating the need for internal model knowledge and achieving superior accuracy in real-world scenarios.
Authors:Jinhee Kim, Seoyeon Yoon, Taeho Lee, Joo Chan Lee, Kang Eun Jeon, Jong Hwan Ko
Abstract:
The deployment of deep neural networks on edge devices is a challenging task due to the increasing complexity of state-of-the-art models, requiring efforts to reduce model size and inference latency. Recent studies explore models operating at diverse quantization settings to find the optimal point that balances computational efficiency and accuracy. Truncation, an effective approach for achieving lower bit precision mapping, enables a single model to adapt to various hardware platforms with little to no cost. However, formulating a training scheme for deep neural networks to withstand the associated errors introduced by truncation remains a challenge, as the current quantization-aware training schemes are not designed for the truncation process. We propose TruncQuant, a novel truncation-ready training scheme allowing flexible bit precision through bit-shifting in runtime. We achieve this by aligning TruncQuant with the output of the truncation process, demonstrating strong robustness across bit-width settings, and offering an easily implementable training scheme within existing quantization-aware frameworks. Our code is released at https://github.com/a2jinhee/TruncQuant.
中文: TruncQuant是一种创新的训练方案,通过运行时位移使深度神经网络能灵活适应不同位精度,在多种硬件平台上均表现出强大鲁棒性,且易于集成到现有量化感知框架中。
English: TruncQuant is a novel training scheme that enables deep neural networks to adapt flexibly to various bit precisions through runtime bit-shifting, providing robust performance across different hardware platforms while being easily integrated into existing quantization-aware frameworks.
Authors:Manish Bhatt
Abstract:
The Bhatt Conjectures framework introduces rigorous, hierarchical benchmarks for evaluating AI reasoning and understanding, moving beyond pattern matching to assess representation invariance, robustness, and metacognitive self-awareness. The agentreasoning-sdk demonstrates practical implementation, revealing that current AI models struggle with complex reasoning tasks and highlighting the need for advanced evaluation protocols to distinguish genuine cognitive abilities from statistical inference.
https://github.com/mbhatt1/agentreasoning-sdk
中文:Bhatt猜想框架提出了评估AI推理能力的严格基准,而agentreasoning-sdk的实际应用揭示了当前模型在复杂任务中的不足,并强调需要更先进的评估方案来区分真实认知与统计推断。
English: The Bhatt Conjectures establish advanced benchmarks to assess AI's true reasoning capabilities beyond pattern recognition, while the agentreasoning-sdk implementation shows current models' limitations in complex tasks and the necessity for better evaluation methods.
Authors:Xiaoxin Lu, Ranran Haoran Zhang, Yusen Zhang, Rui Zhang
Abstract:
People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM's capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines. Our code and data are available at https://github.com/psunlpgroup/MPlanner.
中文: 本文提出了一种新颖框架,通过逐步生成和优化文本-图像计划来解决多模态对齐与连贯性挑战,并借助新基准和评估指标证明了该框架在不同骨干模型中的即插即用有效性。
English: This paper introduces a novel framework that generates and refines text-image plans step-by-step, addressing multimodal alignment and coherence challenges while demonstrating plug-and-play effectiveness across various backbone models through a new benchmark and metrics.
Authors:Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Nikos Deligiannis, Aleksandra Pižurica
Abstract:
Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n^2), followed by spectral clustering. However, these methods are computationally intensive, generally incorporating solely either local or non-local spatial structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process.
We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure (i.e., spatial continuity within subspaces), we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure (i.e., spectral continuity), we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. Notably, these two constraints are jointly optimized to reinforce each other.
Specifically, our model is designed as an one-stage approach in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method is O(n), making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques. Our code is available at: https://github.com/lxlscut/SCDSC
中文: 该方法通过单阶段深度学习框架,以线性复杂度联合优化局部和非局部结构约束,在保持空间连续性和光谱相似性的同时显著提升了高光谱图像聚类性能。
English: The proposed one-stage deep clustering method efficiently captures both local and non-local structures in hyperspectral images with linear complexity, outperforming existing techniques through joint optimization of spatial and spectral constraints.
Authors:Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan
Abstract:
Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.
中文: GLAP扩展了CLAP,增加了多语言和多领域能力,在音频文本检索中表现优异,并在语音及跨语言任务中显著超越现有方法。
English: GLAP extends CLAP by incorporating multilingual and multi-domain capabilities, achieving competitive performance in audio-text retrieval and excelling in speech and multilingual tasks.
Authors:Sadman Sadeed Omee, Lai Wei, Sourin Dey, Jianjun Hu
Abstract:
Crystalline materials can form different structural arrangements (i.e. polymorphs) with the same chemical composition, exhibiting distinct physical properties depending on how they were synthesized or the conditions under which they operate. For example, carbon can exist as graphite (soft, conductive) or diamond (hard, insulating). Computational methods that can predict these polymorphs are vital in materials science, which help understand stability relationships, guide synthesis efforts, and discover new materials with desired properties without extensive trial-and-error experimentation. However, effective crystal structure prediction (CSP) algorithms for inorganic polymorph structures remain limited. We propose ParetoCSP2, a multi-objective genetic algorithm for polymorphism CSP that incorporates an adaptive space group diversity control technique, preventing over-representation of any single space group in the population guided by a neural network interatomic potential. Using an improved population initialization method and performing iterative structure relaxation, ParetoCSP2 not only alleviates premature convergence but also achieves improved convergence speed. Our results show that ParetoCSP2 achieves excellent performance in polymorphism prediction, including a nearly perfect space group and structural similarity accuracy for formulas with two polymorphs but with the same number of unit cell atoms. Evaluated on a benchmark dataset, it outperforms baseline algorithms by factors of 2.46-8.62 for these accuracies and improves by 44.8\%-87.04\% across key performance metrics for regular CSP. Our source code is freely available at https://github.com/usccolumbia/ParetoCSP2.
Chinese: 该研究提出ParetoCSP2算法,通过自适应空间群多样性控制和神经网络势能指导,改进了无机多晶型物的晶体结构预测,在准确性和收敛速度上均显著优于基准算法。
English: The study introduces ParetoCSP2, a multi-objective genetic algorithm that enhances crystal structure prediction for inorganic polymorphs by incorporating adaptive space group diversity control and neural network potentials, achieving superior accuracy and convergence compared to baseline methods.
Authors:Kyung Rok Kim, Yansong Wang, Xiaocheng Li, Guanting Chen
Abstract:
With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.
中文: 本研究提出了一种实用算法,通过整合不同来源的数据集,在理论上保证以高概率降低总体损失,并在线性回归和更广泛的机器学习应用中通过数值实验验证了其有效性。
English: This research introduces a practical algorithm that merges datasets from various sources to minimize population loss with theoretical guarantees, validated through numerical experiments in linear regression and broader machine learning tasks.
Authors:Shijie Fang, Hang Yu, Qidi Fang, Reuben M. Aronson, Elaine S. Short
Abstract:
Learning from Demonstration (LfD) is a popular approach for robots to acquire new skills, but most LfD methods suffer from imperfections in human demonstrations. Prior work typically treats these suboptimalities as random noise. In this paper we study non-optimal behaviors in non-expert demonstrations and show that they are systematic, forming what we call demonstration sidetracks. Using a public space study with 40 participants performing a long-horizon robot task, we recreated the setup in simulation and annotated all demonstrations. We identify four types of sidetracks (Exploration, Mistake, Alignment, Pause) and one control pattern (one-dimension control). Sidetracks appear frequently across participants, and their temporal and spatial distribution is tied to task context. We also find that users' control patterns depend on the control interface. These insights point to the need for better models of suboptimal demonstrations to improve LfD algorithms and bridge the gap between lab training and real-world deployment. All demonstrations, infrastructure, and annotations are available at https://github.com/AABL-Lab/Human-Demonstration-Sidetracks.
中文: 本研究识别了人类机器人演示中系统性的非最优行为,称为“演示旁路”,将其分为四种类型,并将其模式与任务情境和控制界面相关联,以改进演示学习算法。
English: This research identifies systematic non-optimal behaviors called "demonstration sidetracks" in human robot demonstrations, categorizing them into four types and linking their patterns to task context and control interfaces to improve Learning from Demonstration algorithms.
Authors:Mae Younes, Adnane Boukhayma
Abstract:
2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an antialiased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
Chinese: AA-2DGS通过引入世界空间平滑核和新型物体空间Mip滤波器,解决了二维高斯泼溅的混叠问题,在保持几何精度的同时显著提升了多尺度渲染质量。
English: AA-2DGS addresses aliasing issues in 2D Gaussian Splatting by introducing a world space smoothing kernel and a novel object space Mip filter, enhancing rendering quality across scales while preserving geometric accuracy.
Authors:Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Tri Nguyen, Shane Halse
Abstract:
Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students' clinical skills with subjective physicians' preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students' utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge
本文提出了一种结合模糊逻辑和大语言模型的LLM-as-a-Fuzzy-Judge方法,通过自动化评估医学生的临床沟通技能,实现了与医师主观判断相一致的、可解释的人工智能评价系统。
This paper introduces LLM-as-a-Fuzzy-Judge, a method combining fuzzy logic and large language models to provide automated, interpretable evaluation of medical students' clinical communication skills that aligns with nuanced physician judgments.
Authors:Hongyu Chen, Jiping Liu, Yong Wang, Jun Zhu, Dejun Feng, Yakun Xie
Abstract:
Unsupervised Domain Adaptation (UDA) has shown promise in effectively alleviating the performance degradation caused by domain gaps between source and target domains, and it can potentially be generalized to UAV object detection in adverse scenes. However, existing UDA studies are based on natural images or clear UAV imagery, and research focused on UAV imagery in adverse conditions is still in its infancy. Moreover, due to the unique perspective of UAVs and the interference from adverse conditions, these methods often fail to accurately align features and are influenced by limited or noisy pseudo-labels. To address this, we propose the first benchmark for UAV object detection in adverse scenes, the Statistical Feedback-Driven Threshold and Mask Adjustment Teacher-Student Framework (SF-TMAT). Specifically, SF-TMAT introduces a design called Dynamic Step Feedback Mask Adjustment Autoencoder (DSFMA), which dynamically adjusts the mask ratio and reconstructs feature maps by integrating training progress and loss feedback. This approach dynamically adjusts the learning focus at different training stages to meet the model's needs for learning features at varying levels of granularity. Additionally, we propose a unique Variance Feedback Smoothing Threshold (VFST) strategy, which statistically computes the mean confidence of each class and dynamically adjusts the selection threshold by incorporating a variance penalty term. This strategy improves the quality of pseudo-labels and uncovers potentially valid labels, thus mitigating domain bias. Extensive experiments demonstrate the superiority and generalization capability of the proposed SF-TMAT in UAV object detection under adverse scene conditions. The Code is released at https://github.com/ChenHuyoo .
中文:提出的SF-TMAT框架通过动态掩码调整和基于方差的阈值策略,有效提升恶劣场景下无人机目标检测的特征对齐与伪标签质量,实验证明其具有优越性能。
English: The proposed SF-TMAT framework introduces dynamic mask adjustment and variance-based thresholding to improve feature alignment and pseudo-label quality for UAV object detection in adverse conditions, demonstrating superior performance in experiments.
Authors:Ching Chang, Ming-Chih Lo, Wen-Chih Peng, Tien-Fu Chen
Abstract:
Multivariate time series data, collected across various fields such as manufacturing and wearable technology, exhibit states at multiple levels of granularity, from coarse-grained system behaviors to fine-grained, detailed events. Effectively segmenting and integrating states across these different granularities is crucial for tasks like predictive maintenance and performance optimization. However, existing time series segmentation methods face two key challenges: (1) the inability to handle multiple levels of granularity within a unified model, and (2) limited adaptability to new, evolving patterns in dynamic environments. To address these challenges, we propose PromptTSS, a novel framework for time series segmentation with multi-granularity states. PromptTSS uses a unified model with a prompting mechanism that leverages label and boundary information to guide segmentation, capturing both coarse- and fine-grained patterns while adapting dynamically to unseen patterns. Experiments show PromptTSS improves accuracy by 24.49% in multi-granularity segmentation, 17.88% in single-granularity segmentation, and up to 599.24% in transfer learning, demonstrating its adaptability to hierarchical states and evolving time series dynamics. Our code is available at https://github.com/blacksnail789521/PromptTSS.
Chinese: PromptTSS是一种新颖的时间序列分割框架,通过采用带有提示机制的统一模型,有效解决了现有方法无法处理多粒度状态和适应动态环境的问题,显著提升了分割精度和迁移学习能力。
English: PromptTSS is a novel framework that addresses the limitations of existing time series segmentation methods by using a unified model with a prompting mechanism to handle multi-granularity states and adapt to evolving patterns, significantly improving accuracy in segmentation and transfer learning tasks.
Authors:Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Abstract:
Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.
中文: WaveFormer提出了一种基于Transformer的轻量级模型,通过可学习的小波变换融合时频特征,仅用310万参数就实现了95%的分类准确率,并能在嵌入式系统上实时部署。
English: WaveFormer introduces a lightweight transformer-based model that integrates time-frequency features through a learnable wavelet transform, achieving 95% accuracy with only 3.1 million parameters and enabling real-time deployment on embedded systems.
Authors:Simon Ghyselincks, Valeriia Okhmak, Stefano Zampini, George Turkiyyah, David Keyes, Eldad Haber
Abstract:
Visualizing the first few kilometers of the Earth's subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional subsurface region by training a neural network. The Earth's land area having been extensively mapped for geological features, the bottleneck of this or any related technique is the availability of data below the surface. We close this data gap in the development of subsurface deep learning by designing a synthetic data-generator process that mimics eons of geological activity such as sediment compaction, volcanic intrusion, and tectonic dynamics to produce a virtually limitless number of samples of the near lithosphere. A foundation model trained on such synthetic data is able to generate a 3D image of the subsurface from a previously unseen map of surface topography and geology, showing increasing fidelity with increasing access to borehole data, depicting such structures as layers, faults, folds, dikes, and sills. We illustrate the early promise of the combination of a synthetic lithospheric generator with a trained neural network model using generative flow matching. Ultimately, such models will be fine-tuned on data from applicable campaigns, such as mineral prospecting in a given region. Though useful in itself, a regionally fine-tuned models may be employed not as an end but as a means: as an AI-based regularizer in a more traditional inverse problem application, in which the objective function represents the mismatch of additional data with physical models with applications in resource exploration, hazard assessment, and geotechnical engineering.
中文: 通过生成式人工智能,利用地表数据和钻孔信息训练神经网络,结合模拟地质活动的合成数据,实现了对地球近地表三维结构的可视化建模。
English: Deep learning is enabling the visualization of the Earth's subsurface by using generative AI to create 3D models from surface data and boreholes, trained on synthetic geological data that mimics natural processes.
Authors:Linhao Yu, Xinguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, V. W., Fuzheng Zhang, Deyi Xiong
Abstract:
Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.
中文:AutoCaption是一个利用蒙特卡洛树搜索自动生成多样化视频描述的框架,它构建的MCTS-VCB基准能全面评估多模态大语言模型,并通过微调显著提升模型性能。
English: AutoCaption is an automatic framework using Monte Carlo Tree Search to generate diverse and detailed video captions, creating the MCTS-VCB benchmark that comprehensively evaluates Multimodal Large Language Models and enhances their performance through fine-tuning.
Authors:Sharvari Kamble
Abstract:
Sign Language Recognition (SLR) plays a crucial role in bridging the communication gap between the hearing-impaired community and society. This paper introduces SLRNet, a real-time webcam-based ASL recognition system using MediaPipe Holistic and Long Short-Term Memory (LSTM) networks. The model processes video streams to recognize both ASL alphabet letters and functional words. With a validation accuracy of 86.7%, SLRNet demonstrates the feasibility of inclusive, hardware-independent gesture recognition.
中文: SLRNet是一种基于网络摄像头的实时系统,结合MediaPipe Holistic和LSTM网络,以86.7%的准确率识别美国手语字母和功能词,实现了包容性的手势识别。
English: SLRNet is a real-time webcam-based system using MediaPipe Holistic and LSTM networks to recognize ASL alphabet letters and functional words with 86.7% accuracy, enabling inclusive gesture recognition.
Authors:Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen
Abstract:
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose a novel Mutual-Supervised Learning (MSL) framework for sequential-to-parallel code translation to address the functional equivalence issue. MSL consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that MuSL significantly enhances the performance of the base model: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/musl.
中文: 本文提出了一种互监督学习(MSL)框架,通过翻译器与测试器之间的迭代协作确保功能等价性,显著提升了串行到并行代码转换的性能,并优于现有先进方法。
English: The paper introduces a Mutual-Supervised Learning (MSL) framework that enhances sequential-to-parallel code translation by ensuring functional equivalence through iterative collaboration between a Translator and a Tester, achieving significant performance improvements over existing methods.
Authors:Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu
Abstract:
Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.
Chinese: 本文提出了3D-RAD这一基于CT扫描的大规模三维医学视觉问答数据集,包含六项多样化任务,揭示了现有视觉语言模型在复杂诊断推理中的局限性,并通过专家标注数据的微调实现了显著性能提升。
English: This paper introduces 3D-RAD, a large-scale dataset for 3D medical visual question answering using CT scans, featuring six diverse tasks that reveal the limitations of current vision-language models in complex diagnostic reasoning and demonstrate significant performance improvements through fine-tuning on expert-aligned data.
Authors:Namhoon Kim, Sara Fridovich-Keil
Abstract:
Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for most tasks and signals, a simple regularized grid with interpolation trains faster and to higher quality than any INR with the same number of parameters. We also find limited settings where INRs outperform grids -- namely fitting signals with underlying lower-dimensional structure such as shape contours -- to guide future use of INRs towards the most advantageous applications. Code and synthetic signals used in our analysis are available at https://github.com/voilalab/INR-benchmark.
中文: 本研究评估了多种隐式神经表示和网格方法在不同任务中的表现,发现正则化网格在训练速度和质量上通常优于INRs,但在处理如形状轮廓等低维结构时,INRs展现出独特优势,为未来应用提供了指导。
English: This study evaluates various Implicit Neural Representations (INRs) and grid-based methods across multiple tasks, revealing that regularized grids generally outperform INRs in training speed and quality, except in cases involving lower-dimensional structures like shape contours, which highlight specific advantages for INRs.
Authors:Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome
Abstract:
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
基础视觉编码器需要特征上采样以实现高分辨率任务,JAFAR被提出作为一种轻量级、灵活的上采样器,无需高分辨率监督即可提升空间分辨率,并在多种任务中优于现有方法。
Foundation Vision Encoders require feature upsampling for high-resolution tasks, and JAFAR is introduced as a lightweight, flexible upsampler that enhances spatial resolution without high-resolution supervision, outperforming existing methods across various tasks.
Authors:Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
中文摘要:UITron-Speech是首个端到端GUI智能体,通过处理语音指令和设备截图预测用户操作,利用合成数据集和混合模态训练策略突破文本限制,为人机交互提供更便捷的语音驱动解决方案。
English Summary: UITron-Speech is the first end-to-end GUI agent that processes speech instructions and screenshots to predict user actions, overcoming text-based limitations through synthesized datasets and novel training methods to enhance accessibility in human-computer interaction.
Authors:Hourun Zhu, Chengchao Shen
Abstract:
In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at https://github.com/visresearch/SDMPrune.
中文: 本文在剪枝过程中引入自蒸馏损失以充分利用原始模型的预测来获得更准确的梯度信息,并专注于剪枝MLP模块,在1B规模的大模型中实现了优异的压缩效果和竞争力。
English: This paper introduces a self-distillation loss during pruning to better utilize the original model's predictions for accurate gradient computation and focuses on pruning MLP modules, achieving superior compression and competitive performance among 1B-scale LLMs.
Authors:Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee
Abstract:
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4\%$ and $2\%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER
中文: 提出的MATTER标记化方法融合材料知识以防止碎片化并保持语义完整性,在生成和分类任务中分别实现了4%和2%的平均性能提升,优于现有方法。
English: The proposed MATTER tokenization method integrates materials knowledge to prevent fragmentation and preserve semantic integrity, outperforming existing approaches with average gains of 4% in generation and 2% in classification tasks.
Authors:Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang
Abstract:
Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.
中文摘要:本综述论文系统梳理了大语言模型的鲁棒性研究,涵盖对抗性和分布外场景的应对策略,建立了相关术语体系和评估方法,并为该领域未来发展提供了研究方向与资源支持。
English Summary: This survey paper provides a comprehensive review of Large Language Models' robustness, covering adversarial and out-of-distribution scenarios while establishing formal definitions and evaluation methods to support future research.
Authors:Jaeho Lee, Atharv Chowdhary
Abstract:
Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.
中文摘要:AssertBench是一个新基准,用于评估当用户将同一证据支持的事实表述为正确或错误时,大型语言模型是否能保持一致性的事实判断,从而测试其不因迎合用户而改变评估的能力。
English Summary: AssertBench is a new benchmark that evaluates whether LLMs maintain consistent factual judgments when users frame the same evidence-backed facts as either correct or incorrect, testing their ability to resist switching evaluations to simply agree with users.
Authors:Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng
Abstract:
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
中文摘要:本文提出了一种动态稀疏注意力机制,能在注意力图层面自适应分配掩码,在保持计算效率的同时与全注意力模型高度契合,实现了大规模语言模型的可扩展部署。
English Summary: This paper introduces a dynamic sparse attention mechanism that adaptively assigns attention masks at the map level, maintaining computational efficiency while achieving high alignment with full-attention models and enabling scalable deployment of large language models.
Authors:Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun
Abstract:
Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.
中文: 大型语言模型正在将搜索引擎转变为对话式搜索引擎,促使传统搜索引擎优化向对话式搜索引擎优化转变,但最新研究通过C-SEO Bench基准测试发现,现有方法在跨领域和竞争场景中效果有限,传统优化策略反而更具优势。
English: Large Language Models are evolving search engines into conversational systems, prompting the shift from traditional SEO to C-SEO, but current methods show limited effectiveness across domains and competitive scenarios, as revealed by the new C-SEO Bench benchmark.
Authors:Justin Asher
Abstract:
The expanding Lean 4 ecosystem poses challenges for navigating its vast libraries. This paper introduces LeanExplore, a search engine for Lean 4 declarations. LeanExplore enables users to semantically search for statements, both formally and informally, across select Lean 4 packages (including Batteries, Init, Lean, Mathlib, PhysLean, and Std). This search capability is powered by a hybrid ranking strategy, integrating scores from a multi-source semantic embedding model (capturing conceptual meaning from formal Lean code, docstrings, AI-generated informal translations, and declaration titles), BM25+ for keyword-based lexical relevance, and a PageRank-based score reflecting declaration importance and interconnectedness. The search engine is accessible via a dedicated website (https://www.leanexplore.com/) and a Python API (https://github.com/justincasher/lean-explore). Furthermore, the database can be downloaded, allowing users to self-host the service. LeanExplore integrates easily with LLMs via the model context protocol (MCP), enabling users to chat with an AI assistant about Lean declarations or utilize the search engine for building theorem-proving agents. This work details LeanExplore's architecture, data processing, functionalities, and its potential to enhance Lean 4 workflows and AI-driven mathematical research
LeanExplore 搜索引擎通过混合排名策略(结合概念嵌入、词法相关性和声明互连性)实现了对 Lean 4 库的语义搜索,解决了庞大生态系统的导航难题,支持网页访问、API调用及大语言模型集成。
The LeanExplore search engine addresses the challenge of navigating Lean 4's extensive libraries by enabling semantic searches through a hybrid ranking system that combines conceptual embeddings, lexical relevance, and declaration interconnectedness, accessible via web interface, API, and LLM integration.
Authors:Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu
Abstract:
We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another's reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method's effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: https://github.com/aliasad059/RedDebate)
Chinese: RedDebate是一种创新的多智能体辩论框架,通过大型语言模型之间的对抗性论证来自动识别和减少不安全行为,仅通过辩论即可降低17.7%的不安全行为,结合长期记忆模块后降低幅度超过23.5%。
English: RedDebate is an innovative multi-agent debate framework that uses adversarial argumentation among Large Language Models to automatically identify and mitigate unsafe behaviors, achieving a 17.7% reduction through debate alone and over 23.5% when combined with long-term memory modules.
Authors:Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang
Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework.
Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination.
MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.
中文摘要:MANBench是一个双语基准测试,旨在严格评估多模态大语言模型(MLLMs)在多样化任务中的表现,结果表明尽管MLLMs在知识和图文理解等领域表现优异,但在深层跨模态推理和复杂任务方面仍未能达到人类水平。
English Summary: MANBench is a bilingual benchmark designed to rigorously evaluate Multimodal Large Language Models (MLLMs) across diverse tasks, revealing that while MLLMs excel in certain areas like Knowledge and Text-Image Understanding, they still fall short of human-level performance in deeper cross-modal reasoning and complex tasks.
Authors:Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu
Abstract:
Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.
Chinese: 大型推理模型利用反思标记指导多步推理,本研究提出CyclicReflex动态调度方法,通过优化标记分配策略,在多个数学基准测试中有效提升了模型的计算性能。
English: Large reasoning models use reflection tokens to guide multi-step reasoning, and this work introduces CyclicReflex, a dynamic scheduling method that optimizes their allocation to enhance computational performance across various benchmarks.
Authors:Henrik Abgaryan, Tristan Cazenave, Ararat Harutyunyan
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their direct application to NP-hard combinatorial problems (CPs) remains underexplored. In this work, we systematically investigate the reasoning abilities of LLMs on a variety of NP-hard combinatorial optimization tasks and introduce ACCORD: Autoregressive Constraint-satisfying generation for COmbinatorial optimization with Routing and Dynamic attention. ACCORD features a novel dataset representation and model architecture that leverage the autoregressive nature of LLMs to dynamically enforce feasibility constraints, coupled with attention-based routing to activate problem-specific LoRA modules. We also present the ACCORD-90k supervised dataset, covering six NP-hard combinatorial problems: TSP, VRP, Knapsack, FlowShop, JSSP, and BinPacking. Extensive experiments demonstrate that our ACCORD model, built on an 8B-parameter Llama backbone, consistently outperforms standard prompting and input-output methods, even when compared to much larger LLMs, such as gpt-4. Ablation studies further show that our output structure enhances solution feasibility. To the best of our knowledge, this is the first large-scale, end-to-end framework for exploring the applications of LLMs to a broad spectrum of combinatorial optimization problems. The codes are publicly available at https://github.com/starjob42/ACCORD
中文: 本文提出ACCORD框架,通过动态满足可行性约束和基于注意力的路由激活专用模块,显著提升大语言模型解决NP难组合优化问题的能力,实验表明其性能优于标准方法及更大规模模型。
English: This paper introduces ACCORD, a novel framework that enhances LLMs' ability to solve NP-hard combinatorial problems by dynamically enforcing feasibility constraints and using attention-based routing to activate specialized modules, demonstrating superior performance over standard methods even with smaller models.
Authors:Jiaqi Zhao, Weili Guan, Ming Li, Miao Zhang
Abstract:
Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.
中文摘要:本文首次将零空间概念引入大语言模型量化领域,提出可即插即用的Q2N模块,通过将量化后权重扰动约束在输入激活的零空间内有效降低误差,并在多个前沿模型上通过实验验证了其有效性。
English Summary: This paper introduces the concept of null space to large language model quantization, proposing a plug-and-play Q2N module that reduces quantization error by constraining weight perturbations within input activations' null space, validated through extensive experiments on leading models.
Authors:Linjie Li, Zhenyu Wu, Yang Ji
Abstract:
Class-incremental learning (CIL) requires deep learning models to continuously acquire new knowledge from streaming data while preserving previously learned information. Recently, CIL based on pre-trained models (PTMs) has achieved remarkable success. However, prompt-based approaches suffer from prompt overwriting, while adapter-based methods face challenges such as dimensional misalignment between tasks. While the idea of expert fusion in Mixture of Experts (MoE) can help address dimensional inconsistency, both expert and routing parameters are prone to being overwritten in dynamic environments, making MoE challenging to apply directly in CIL. To tackle these issues, we propose a mixture of task-specific experts (MoTE) framework that effectively mitigates the miscalibration caused by inconsistent output dimensions across tasks. Inspired by the weighted feature fusion and sparse activation mechanisms in MoE, we introduce task-aware expert filtering and reliable expert joint inference during the inference phase, mimicking the behavior of routing layers without inducing catastrophic forgetting. Extensive experiments demonstrate the superiority of our method without requiring an exemplar set. Furthermore, the number of tasks in MoTE scales linearly with the number of adapters. Building on this, we further explore the trade-off between adapter expansion and model performance and propose the Adapter-Limited MoTE. The code is available at https://github.com/Franklilinjie/MoTE.
Chinese: 本文提出了一种任务特定专家混合(MoTE)框架,通过任务感知的专家筛选和联合推理机制,有效解决了类增量学习中的提示覆盖和维度不匹配问题,无需样本集即可避免灾难性遗忘。
English: This paper introduces a Mixture of Task-specific Experts (MoTE) framework to address challenges in class-incremental learning, such as prompt overwriting and dimensional misalignment, by leveraging task-aware expert filtering and joint inference to prevent catastrophic forgetting without needing exemplars.
Authors:Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer
Abstract:
As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model's response with the phrase "Let's examine the style and the synthesis artifacts" -- a method we call zero-shot-s$^2$ -- boosts Macro F1 scores by 8%-29%. These gains are consistent for two widely used open-source models and across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models -- demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations -- suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s$^2$ scales better than chain-of-thought in most cases -- indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images -- offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: https://github.com/Zoher15/Zero-shot-s2.
中文: 本研究提出了零样本s²方法,通过任务对齐提示显著提升了视觉语言模型在无需微调的情况下检测AI生成图像的能力,并在多种数据集和模型上展现出强大的泛化性。
English: This study introduces zero-shot-s², a task-aligned prompting method that significantly enhances Vision-Language Models' ability to detect AI-generated images without fine-tuning, demonstrating strong generalization across diverse datasets and models.
Authors:Chaitanya Ravuri, Saman Amarasinghe
Abstract:
Modern code-generation LLMs can already solve a large fraction of programming problems, yet they still hallucinate subtle bugs that make their outputs unsafe for autonomous deployment. We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score. The wrapper samples many candidate programs, executes each on a self-generated test suite, and clusters candidates whose I/O behavior is identical; the empirical mass of the largest cluster serves as an exact confidence estimate. A single scalar threshold on this estimate lets users trade coverage for reliability with exponential guarantees. On LiveCodeBench our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from ~65% to 2%, and drives it to 0% at a conservative threshold while still answering 15.6% of prompts. Manual audits show that the few residual mistakes stem from prompt misinterpretation, not random generation noise, narrowing future work to specification clarity. Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models, offering a practical path toward dependable, autonomous code generation. Our code is available on Github (https://github.com/20ChaituR/functional-clustering).
中文摘要:功能聚类是一种黑盒包装方法,通过采样和测试候选程序,几乎消除了代码生成大语言模型中的幻觉错误,并提供可调置信度分数,使用户能够以指数级保证在覆盖率和可靠性之间进行权衡。
English Summary: Functional clustering is a black-box wrapper that eliminates nearly all hallucination-induced errors in code-generation LLMs by sampling and testing candidate programs, providing a tunable confidence score to trade coverage for reliability with exponential guarantees.
Authors:Jorge Martinez-Gil
Abstract:
Detecting code clones is relevant to software maintenance and code refactoring. This challenge still presents unresolved cases, mainly when structural similarity does not reflect functional equivalence, though recent code models show promise. Therefore, this research aims to systematically measure the performance of several newly introduced small code models in classifying code pairs as clones or non-clones. The evaluation is based on five datasets: BigCloneBench, CodeJam, Karnalim, POJ104, and PoolC, as well as six code models: CodeBERT, GraphCodeBERT, Salesforce T5, UniXCoder, PLBART, and Polycoder. Most models performed well across standard metrics, including accuracy, precision, recall, and F1-score. However, a marginal fraction of clones remains challenging to detect, especially when the code looks similar but performs different operations. The source code that illustrates our approach is available at: https://github.com/jorge-martinez-gil/small-code-models
Chinese: 本研究评估了六种小型代码模型在五个数据集上的代码克隆检测性能,发现尽管大多数模型表现良好,但结构相似而功能不同的少量克隆仍难以准确识别。
English: This study evaluates the performance of six small code models in detecting code clones across five datasets, finding that while most models achieve strong results, a small subset of functionally different but structurally similar clones remains challenging to identify.
Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
中文摘要:AutoMind是一种自适应大型语言模型智能体框架,通过融合专家知识、策略性解决方案探索和动态编码,在自动化数据科学基准测试中展现出卓越性能。
English Summary: AutoMind is an adaptive LLM-agent framework that enhances automated data science by integrating expert knowledge, strategic solution exploration, and dynamic coding, achieving superior performance on benchmarks.
Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Zhuoyun Yu, Shuofei Qiao, Jintian Zhang, Da Zheng, Yuren Mao, Yunjun Gao, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.
中文摘要:AutoMind是一种自适应大型语言模型智能体框架,通过融合专家知识、策略性解决方案探索和动态编码,在自动化数据科学基准测试中展现出卓越性能。
English Summary: AutoMind is an adaptive LLM-agent framework that enhances automated data science by integrating expert knowledge, strategic solution exploration, and dynamic coding, achieving superior performance on benchmarks.
Authors:Julius Berner, Miguel Liu-Schiaffini, Jean Kossaifi, Valentin Duruisseaux, Boris Bonev, Kamyar Azizzadenesheli, Anima Anandkumar
Abstract:
A wide range of scientific problems, such as those described by continuous-time dynamical systems and partial differential equations (PDEs), are naturally formulated on function spaces. While function spaces are typically infinite-dimensional, deep learning has predominantly advanced through applications in computer vision and natural language processing that focus on mappings between finite-dimensional spaces. Such fundamental disparities in the nature of the data have limited neural networks from achieving a comparable level of success in scientific applications as seen in other fields. Neural operators are a principled way to generalize neural networks to mappings between function spaces, offering a pathway to replicate deep learning's transformative impact on scientific problems. For instance, neural operators can learn solution operators for entire classes of PDEs, e.g., physical systems with different boundary conditions, coefficient functions, and geometries. A key factor in deep learning's success has been the careful engineering of neural architectures through extensive empirical testing. Translating these neural architectures into neural operators allows operator learning to enjoy these same empirical optimizations. However, prior neural operator architectures have often been introduced as standalone models, not directly derived as extensions of existing neural network architectures. In this paper, we identify and distill the key principles for constructing practical implementations of mappings between infinite-dimensional function spaces. Using these principles, we propose a recipe for converting several popular neural architectures into neural operators with minimal modifications. This paper aims to guide practitioners through this process and details the steps to make neural operators work in practice. Our code can be found at https://github.com/neuraloperator/NNs-to-NOs
中文摘要:本文提出神经算子作为神经网络向无限维函数空间映射的原则性扩展,提供了一个实用框架,可将现有神经架构应用于科学计算领域,如求解偏微分方程。
English summary: This paper introduces neural operators as a principled extension of neural networks to handle mappings between infinite-dimensional function spaces, providing a practical framework to adapt existing neural architectures for scientific applications like solving partial differential equations.
Authors:Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Abstract:
Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla's law by reducing extrapolation error by 433\%. This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. We are comprehensively open-sourcing all models, data, results, and logs at https://github.com/Farseer-Scaling-Law/Farseer to foster further research.
中文: Farseer提出了一种改进的扩展定律,显著提升了大型语言模型训练的预测准确性,能够从小规模实验可靠地推断大规模性能,相比Chinchilla定律将外推误差降低了433%。
English: Farseer introduces a refined scaling law that significantly improves predictive accuracy for large language model training, enabling reliable extrapolation from small-scale experiments to large-scale performance and reducing extrapolation error by 433% compared to Chinchilla's law.
Authors:Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Abstract:
In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.
中文摘要:CDPruner提出了一种无需训练、与模型无关的视觉令牌剪枝方法,通过行列式点过程最大化条件多样性,在显著降低多模态大语言模型计算成本的同时保持了性能表现。
English Summary: CDPruner introduces a training-free, model-agnostic visual token pruning method that maximizes conditional diversity using determinantal point processes, significantly reducing computational costs while preserving performance across multimodal large language models.
Authors:Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Abstract:
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
本文提出知识图像生成作为新任务及MMMG基准,通过评估16个领先模型揭示了当前AI在多模态推理方面的显著不足。
This paper introduces knowledge image generation as a new task and the MMMG benchmark to evaluate multimodal reasoning in AI models, revealing significant gaps in current systems through comprehensive testing of 16 leading models.
Authors:Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Abstract:
Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.
中文摘要:本研究提出了一个全面、专业标注的中文有害内容检测基准,涵盖六大类别并基于真实数据构建,同时通过知识增强基线方法,使较小模型能达到与顶尖大语言模型相媲美的性能。
English Summary: This study introduces a comprehensive, professionally annotated benchmark for Chinese harmful content detection, covering six categories and utilizing real-world data, along with a knowledge-augmented baseline that enhances smaller models' performance to match state-of-the-art LLMs.
Authors:Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng
Abstract:
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.
中文: 本文提出SWE-Factory自动化流程,通过集成多智能体环境构建、标准化退出码评分和自动验证,高效创建大规模GitHub问题解决数据集,在多种编程语言中实现了高精度和成本效益。
English: This paper introduces SWE-Factory, an automated pipeline that efficiently constructs large-scale GitHub issue resolution datasets by integrating multi-agent environment setup, standardized exit-code grading, and automated validation, achieving high accuracy and cost-effectiveness across multiple programming languages.
Authors:Boaz Lavon, Shahar Katz, Lior Wolf
Abstract:
We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: https://github.com/boazlavon/eg_cfg
中文: 我们提出了一种新颖的神经代码生成方法——执行引导的无分类器引导(EG-CFG),通过在推理过程中动态整合实时执行反馈来引导模型生成可执行代码,在各类编程任务中实现了最先进的性能。
English: We introduce Execution-Guided Classifier-Free Guidance (EG-CFG), a novel neural code generation method that dynamically integrates real-time execution feedback during inference to guide the model toward producing executable code, achieving state-of-the-art performance across diverse coding tasks.
Authors:Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Abstract:
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo
中文:Duo方法通过将高斯扩散技术迁移到均匀状态离散扩散模型中,采用课程学习策略加速训练,并引入离散一致性蒸馏实现快速少步生成,在部分基准测试中超越了自回归模型的性能。
English: The Duo method enhances uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, using curriculum learning to accelerate training and discrete consistency distillation to enable fast few-step generation, outperforming autoregressive models on some benchmarks.
Authors:Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi, Lei Ma, Hui Zhang, Jie Shao, Xinglong Wu
Abstract:
Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter
中文摘要:CreatiPoster是一个通过AI生成可编辑多层图形设计的框架,它基于用户输入创建专业级排版,在自动布局和视觉质量上超越现有工具,并支持响应式调整与动画等多样化应用。
English Summary: CreatiPoster is an AI framework that generates editable, multi-layer graphic designs from user inputs, outperforming existing tools by producing professional-quality compositions while enabling diverse applications like responsive resizing and animation.
Authors:Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu
Abstract:
Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.
Chinese: ReCUT方法通过逐步探索生成多样化推理路径并训练分别优化准确性和简洁性的双模型,在保持或提升推理准确率的同时将推理链长度缩减30-50%。
English: The ReCUT method enhances LLM reasoning by generating diverse stepwise paths and training dual specialized models for accuracy and brevity, achieving 30-50% shorter reasoning chains without compromising accuracy.
Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou
Abstract:
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
中文: VideoDeepResearch提出了一种智能代理框架,仅通过文本推理模型和模块化工具即可实现卓越的长视频理解能力,在多个基准测试中显著超越了现有最先进的多模态大语言模型。
English: VideoDeepResearch introduces an agentic framework that achieves superior long video understanding using only a text-based reasoning model and modular tools, surpassing state-of-the-art MLLMs by significant margins on multiple benchmarks.
Authors:Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang, Min Liu, Yaonan Wang, Gaolei Li, Xinxing Cheng, Jinming Duan
Abstract:
Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.
中文摘要:本文提出SmoothProper,一种即插即用的神经模块,通过强制平滑性和促进网络前向传播中的信息传递,有效解决了视网膜血管等特征稀疏图像中的孔径和大位移挑战,提升了无监督形变图像配准的性能。
English Summary: The paper introduces SmoothProper, a plug-and-play neural module that enhances unsupervised deformable image registration by enforcing smoothness and enabling effective message passing to address aperture and large-displacement challenges in feature-sparse images like retinal vessels.
Authors:Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens
Abstract:
Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.
中文摘要:LangEdit提出了一种零空间约束框架,通过将参数更新投影至正交子空间来隔离语言特定知识更新,在多种模型架构和任务中有效防止参数干扰并保持多语言泛化能力。
English Summary: LangEdit introduces a null-space constrained framework that isolates language-specific knowledge updates in LLMs by projecting parameter changes onto orthogonal subspaces, effectively preventing interference while maintaining multilingual generalization across diverse models and tasks.
Authors:Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.
中文: 本研究提出了OPT-BENCH基准来评估大语言模型在迭代优化任务中的表现,并证明利用历史反馈能显著提升其在机器学习和NP问题上的解决能力。
English: The study introduces OPT-BENCH, a benchmark for evaluating LLM agents on iterative optimization tasks, and demonstrates that leveraging historical feedback significantly improves their performance across machine learning and NP problems.
Authors:Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
Abstract:
The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
中文: TaxoAdapt是一种新颖框架,通过迭代式层次分类动态调整大语言模型生成的分类体系,使其适应科学文献的多维特性,在保持粒度和连贯性方面显著优于现有方法。
English: TaxoAdapt is a novel framework that dynamically adapts LLM-generated taxonomies to scientific corpora across multiple dimensions, achieving superior granularity and coherence through iterative hierarchical classification.
Authors:Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu, Xian Wu, Yefeng Zheng
Abstract:
Recently, the rapid advancements of vision-language models, such as CLIP, leads to significant progress in zero-/few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based ZFSAD methods commonly assume prior knowledge of categories and rely on carefully crafted prompts tailored to specific scenarios. While such meticulously designed text prompts effectively capture semantic information in the textual space, they fall short of distinguishing normal and anomalous instances within the joint embedding space. Moreover, these ZFSAD methods are predominantly explored in industrial scenarios, with few efforts conducted to medical tasks. To this end, we propose an innovative framework for ZFSAD tasks in medical domain, denoted as IQE-CLIP. We reveal that query embeddings, which incorporate both textual and instance-aware visual information, are better indicators for abnormalities. Specifically, we first introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain. Then, we design an instance-aware query module (IQM) to extract region-level contextual information from both text prompts and visual features, enabling the generation of query embeddings that are more sensitive to anomalies. Extensive experiments conducted on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance on both zero-shot and few-shot tasks. We release our code and data at https://github.com/hongh0/IQE-CLIP/.
Chinese: 本文提出IQE-CLIP框架,通过融合文本和视觉信息生成查询嵌入,显著提升了医学影像中的零样本/少样本异常检测性能,在六个医学数据集上实现了最优表现。
English: This paper introduces IQE-CLIP, a novel framework that enhances zero-/few-shot anomaly detection in medical imaging by generating query embeddings that integrate textual and visual information, achieving state-of-the-art results across six medical datasets.
Authors:Priyanka Kargupta, Runchu Tian, Jiawei Han
Abstract:
Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
中文摘要:ClaimSpect框架通过将复杂主张分解为层级化的方面与子方面,从语料库中检索相关视角来全面分析不同观点,有效处理科学和政治领域中难以简单判断真伪的声明。
English Summary: ClaimSpect is a framework that deconstructs nuanced claims into hierarchical aspects and sub-aspects, enabling comprehensive analysis by retrieving relevant perspectives from a corpus to represent diverse viewpoints accurately.
Authors:Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin
Abstract:
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab
Chinese: ConTextTab是一种新颖的表格原生上下文学习框架,通过专用嵌入和大规模真实数据训练整合语义理解与对齐,在多个基准测试中表现优异,并在CARTE基准上创下新标准。
English: ConTextTab is a novel table-native in-context learning framework that integrates semantic understanding and alignment through specialized embeddings and large-scale real-world data training, achieving competitive performance across benchmarks and setting new standards on the CARTE benchmark.
Authors:Igor Urbanik, PaweÅ Gajewski
Abstract:
Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.
Chinese: 本文提出饱和自组织映射(SatSOM),通过引入饱和机制逐步降低已学习神经元的参数更新,有效缓解持续学习中的灾难性遗忘问题。
English: This paper introduces Saturation Self-Organizing Maps (SatSOM), which enhance continual learning by incorporating a saturation mechanism that reduces learning parameters for experienced neurons, thus mitigating catastrophic forgetting.
Authors:Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane
Abstract:
Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX.
中文:提出的ConStyX方法通过同时增强训练数据的内容和风格特征,并减轻过度增强的负面影响,在跨域医学图像分割中实现了优越的泛化性能。
English: The proposed ConStyX method enhances domain generalization for medical image segmentation by augmenting both content and style of training data while mitigating negative effects from over-augmentation, achieving superior performance across multiple domains.
Authors:Marzieh Oghbaie, Teresa Araújo, Hrvoje BogunoviÄ
Abstract:
Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical.
Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales.
Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT
中文摘要:本文提出PiPViT模型,通过视觉转换器学习与临床相关的原型来近似病变范围,在视网膜OCT分类中实现优异性能的同时,提供可解释的诊断依据。
English Summary: The paper introduces PiPViT, an interpretable prototypical model using vision transformers to learn clinically relevant prototypes that approximate lesion extent from image-level labels, achieving competitive performance in retinal OCT classification while providing transparent diagnostic explanations.
Authors:Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm
Abstract:
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at https://github.com/alobashev/hessian-geometry-of-diffusion-models.
Chinese: 本文提出了一种通过重构Fisher信息度量来分析生成模型潜在空间几何结构的新方法,揭示了扩散模型中相变的分形结构,并在热力学量重构方面优于现有基准。
English: This paper introduces a method to reconstruct the Fisher information metric for analyzing latent space geometry in generative models, revealing phase transitions and fractal structures in diffusion models while outperforming baselines in thermodynamic quantity reconstruction.
Authors:Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
Abstract:
This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.
中文: 本文提出了一种结合检索增强提示与大语言模型的系统,用于识别数学推理中的辅导错误,通过结构化提示和可解释预测超越了所有基线方法。
English: This paper introduces a system for identifying tutoring mistakes in mathematical reasoning by combining retrieval-augmented prompting with large language models, which outperforms baseline methods through structured prompts and interpretable predictions.
Authors:Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek
Abstract:
The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.
中文:SDialog是一个模块化的Python工具包,它利用指令调优的大语言模型生成真实可控的合成对话,支持多智能体模拟和场景驱动的工作流程,以推动对话AI研究并确保可复现性。
English: SDialog is a modular Python toolkit that uses instruction-tuned LLMs to generate realistic and controllable synthetic dialogues, supporting multi-agent simulations and scenario-driven workflows to advance conversational AI research and ensure reproducibility.
Authors:Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi
Abstract:
This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.
本文提出了一种采用U-Net分割和自适应数字化技术的两阶段流程,能精准将纸质心电图转换为数字信号,相比基线方法在处理信号重叠情况时表现出显著优势。
This paper introduces a two-stage pipeline using U-Net segmentation and adaptive digitization to accurately convert paper ECG recordings into digital signals, significantly improving performance especially for overlapping signals compared to baseline methods.
Authors:Suin Lee, Dae-Shik Kim
Abstract:
We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object's geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model's original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object's geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at https://github.com/Adios42/Textailor
中文: TexTailor提出了一种新方法,通过引入重采样方案和自适应相机调整,从文本描述生成一致的对象纹理,在视角一致性方面优于现有技术。
English: TexTailor introduces a novel method to generate consistent object textures from text by integrating a resampling scheme and adaptive camera adjustments, outperforming existing approaches in view-consistent texture synthesis.
Authors:Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu
Abstract:
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
Chinese Summary: MSTAR提出了一种无需边界框的场景文本检索方法,通过动态捕捉多粒度文本表征并协调多样化查询,在消除标注成本的同时实现了卓越的检索性能。
English Summary: MSTAR introduces a box-free scene text retrieval method that dynamically captures multi-grained text representations and harmonizes diverse queries, achieving superior performance without costly bounding box annotations.
Authors:Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai
Abstract:
Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.
Chinese: 提出的SSP框架通过结合规则驱动和数据驱动的方法,改进了密集遥感场景中的定向目标检测,实现了卓越的样本分配和边界框提取,以45.78% mAP达到最先进性能。
English: The proposed SSP framework enhances oriented object detection in dense remote sensing scenes by combining rule-driven and data-driven approaches for superior sample assignment and box extraction, achieving state-of-the-art performance with 45.78% mAP.
Authors:Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Abstract:
Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.
中文: 本文对大型语言模型的越狱防护机制进行了全面分析,提出了新的分类体系与评估框架,在评估防护效果的同时识别现有方法的优势与局限,为未来优化防护组合提供了结构化基础。
English: This paper presents a comprehensive analysis of jailbreak guardrails for Large Language Models, introducing a novel taxonomy and evaluation framework to assess their effectiveness while identifying strengths, limitations, and optimization strategies for future development.
Authors:Chengxu Zuo, Jiawei Huang, Xiao Jiang, Yuan Yao, Xiangren Shi, Rui Cao, Xinyu Yi, Feng Xu, Shihui Guo, Yipeng Qin
Abstract:
In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG'G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG'G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG'G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at https://github.com/ZuoCX1996/TIC.
中文摘要:本文提出了一种创新的稀疏惯性动作捕捉系统动态校准方法,首次突破了IMU校准中绝对静态假设的限制,通过两个宽松假设实时估计坐标漂移和测量偏移,实现了无需显式校准流程的隐式校准及稀疏IMU的长期精准动作捕捉。
English Summary: This paper introduces a novel dynamic calibration method for sparse inertial motion capture systems, enabling real-time estimation of coordinate drift and measurement offset without requiring static assumptions, thereby achieving the first implicit calibration and long-term accurate motion capture with sparse IMUs.
Authors:Muskan Dosi, Chiranjeev Chiranjeev, Kartik Thakral, Mayank Vatsa, Richa Singh
Abstract:
Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {https://github.com/IAB-IITJ/Harmonizing-Geometry-and-Uncertainty-Diffusion-with-Hyperspheres/}
中文摘要:HyperSphereDiff 通过引入定向噪声使扩散模型与超球面数据几何对齐,有效保留角度特征并提升多个数据集的生成精度。
English Summary: HyperSphereDiff introduces directional noise to align diffusion models with hyperspherical data geometry, preserving angular patterns and improving generative accuracy across multiple datasets.
Authors:Yutong Zhou, Masahiro Ryo
Abstract:
Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language. Our code is available at: https://github.com/Yutong-Zhou-cv/BioX.
中文: 本研究提出一个端到端的视觉因果框架,将物种图像转化为可解释的栖息地偏好因果分析,通过多模态人工智能与生态建模相结合,生成人类可读的解释。
English: This study introduces an end-to-end visual-to-causal framework that transforms species images into interpretable causal insights about habitat preferences, integrating multimodal AI with ecological modeling to generate human-readable explanations.
Authors:Jing He, Yiqing Wang, Lingling Li, Kexin Zhang, Puhua Chen
Abstract:
This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on https://github.com/delCayr/ContextRefine-Clip
Chinese: 本报告提出CR-CLIP模型,通过引入跨模态注意力流实现视觉与文本特征的双向动态交互优化,在EPIC-KITCHENS-100等任务中无需集成学习即显著超越基线,验证了其在跨模态检索中的高效性。
English: This report introduces CR-CLIP, an efficient model that enhances cross-modal retrieval by integrating a cross-modal attention flow for dynamic interaction between visual and textual features, achieving superior performance on benchmarks like EPIC-KITCHENS-100 without ensemble learning.
Authors:Junhang Cheng, Fang Liu, Chengru Wu, Li Zhang
Abstract:
While Large Language Models (LLMs) have significantly advanced code generation efficiency, they face inherent challenges in balancing performance and inference costs across diverse programming tasks. Dynamically selecting the optimal LLM based on task difficulty and resource constraints offers a promising approach to achieve an optimal balance between efficiency and performance. However, existing model selection methods are resource-intensive and often neglect cost efficiency. Moreover, these approaches rely on human-annotated difficulty labels that are frequently inaccessible in real-world settings and may not align with the LLM's own assessment of task difficulty. In this paper, we introduce AdaptiveLLM, a framework that dynamically selects optimal LLMs for a given coding task by automatically assessing task difficulty. Our framework first estimates task difficulty using Chain-of-Thought lengths generated by reasoning model, clusters these into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features. A trained XGBoost classifier then selects the best model for each problem, optimizing the performance-cost trade-off. Experimental results show that AdaptiveLLM achieves a 7.86% improvement in pass@1 score while reducing resource consumption by 88.9% compared to baseline method ComplexityNet. When compared to a single model, AdaptiveLLM demonstrates an approximately 15% accuracy improvement, while maintaining the same level of cost consumption. Apart from that, the difficulty assessment using CoT provides more reliable selection criteria than human evaluation. Our replication package is available at https://github.com/cjhCoder7/AdaptiveLLM.
中文: AdaptiveLLM框架通过思维链长度自动评估编程任务难度并动态选择最优大语言模型,相比现有方法在显著提升准确率的同时大幅降低了资源消耗。
English: AdaptiveLLM is a framework that dynamically selects the best LLM for coding tasks by automatically assessing difficulty through Chain-of-Thought lengths and clustering, achieving significant improvements in accuracy while drastically reducing resource costs compared to existing methods.
Authors:Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu
Abstract:
Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. The project is released on https://github.com/LiamZhao326/CogStream.
中文摘要:本文提出了CogStream这一具有挑战性的流媒体视频推理任务,要求模型通过筛选相关历史上下文来高效回答问题,并开发了CogReasoner基准模型,利用视觉压缩和对话检索技术,在降低计算负担的同时提升了推理准确性。
English Summary: This paper introduces CogStream, a challenging task for streaming video reasoning that requires models to identify relevant historical context to answer questions efficiently, and presents a baseline model, CogReasoner, which uses visual compression and dialogue retrieval to reduce computational load while improving accuracy.
Authors:Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
Abstract:
Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
Chinese: 本研究将针对表格的科学声明验证重新定义为解释任务,通过构建包含人工标注单元格级依据的数据集,证明尽管融入表格对齐能提升验证性能,但多数大语言模型虽能正确预测却无法产生可信的推理过程。
English: This research reframes scientific claim verification against tables as an explanation task by creating a dataset with human-annotated cell-level rationales, demonstrating that while incorporating table alignment improves verification, most large language models fail to produce faithful reasoning despite correct predictions.
Authors:Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
Abstract:
Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
中文摘要:提出的因果推理蒸馏器(CIDer)框架通过自蒸馏和因果推理模块,解决了多模态情感识别中模态缺失和分布外数据的双重挑战,以更少参数和更快训练实现了鲁棒性能。
English Summary: The proposed Causal Inference Distiller (CIDer) framework addresses simultaneous modality missing and Out-Of-Distribution challenges in Multimodal Emotion Recognition through self-distillation and causal inference modules, achieving robust performance with fewer parameters and faster training.
Authors:OÄuzhan Canpolat, Ataberk Olgun, David Novo, OÄuz Ergin, Onur Mutlu
Abstract:
DRAM is a critical component of modern computing systems. Recent works propose numerous techniques (that we call DRAM techniques) to enhance DRAM-based computing systems' throughput, reliability, and computing capabilities (e.g., in-DRAM bulk data copy). Evaluating the system-wide benefits of DRAM techniques is challenging as they often require modifications across multiple layers of the computing stack. Prior works propose FPGA-based platforms for rapid end-to-end evaluation of DRAM techniques on real DRAM chips. Unfortunately, existing platforms fall short in two major aspects: (1) they require deep expertise in hardware description languages, limiting accessibility; and (2) they are not designed to accurately model modern computing systems.
We introduce EasyDRAM, an FPGA-based framework for rapid and accurate end-to-end evaluation of DRAM techniques on real DRAM chips. EasyDRAM overcomes the main drawbacks of prior FPGA-based platforms with two key ideas. First, EasyDRAM removes the need for hardware description language expertise by enabling developers to implement DRAM techniques using a high-level language (C++). At runtime, EasyDRAM executes the software-defined memory system design in a programmable memory controller. Second, EasyDRAM tackles a fundamental challenge in accurately modeling modern systems: real processors typically operate at higher clock frequencies than DRAM, a disparity that is difficult to replicate on FPGA platforms. EasyDRAM addresses this challenge by decoupling the processor-DRAM interface and advancing the system state using a novel technique we call time scaling, which faithfully captures the timing behavior of the modeled system.
We believe and hope that EasyDRAM will enable innovative ideas in memory system design to rapidly come to fruition. To aid future research EasyDRAM implementation is open sourced at https://github.com/CMU-SAFARI/EasyDRAM.
中文:EasyDRAM是一种基于FPGA的框架,通过高级编程和时间缩放技术实现对DRAM技术的快速精准评估,有效解决了现有平台在可访问性和系统建模方面的不足。
English: EasyDRAM is an FPGA-based framework that enables rapid and accurate evaluation of DRAM techniques using high-level programming and time scaling, overcoming accessibility and modeling limitations of prior platforms.
Authors:Kaiyuan Zhang, Siyuan Cheng, Hanxi Guo, Yuetian Chen, Zian Su, Shengwei An, Yuntao Du, Charles Fleming, Ashish Kundu, Xiangyu Zhang, Ninghui Li
Abstract:
Large language models (LLMs) have achieved remarkable success and are widely adopted for diverse applications. However, fine-tuning these models often involves private or sensitive information, raising critical privacy concerns. In this work, we conduct the first comprehensive study evaluating the vulnerability of fine-tuned LLMs to membership inference attacks (MIAs). Our empirical analysis demonstrates that MIAs exploit the loss reduction during fine-tuning, making them highly effective in revealing membership information. These findings motivate the development of our defense. We propose SOFT (\textbf{S}elective data \textbf{O}bfuscation in LLM \textbf{F}ine-\textbf{T}uning), a novel defense technique that mitigates privacy leakage by leveraging influential data selection with an adjustable parameter to balance utility preservation and privacy protection. Our extensive experiments span six diverse domains and multiple LLM architectures and scales. Results show that SOFT effectively reduces privacy risks while maintaining competitive model performance, offering a practical and scalable solution to safeguard sensitive information in fine-tuned LLMs.
中文摘要:本研究提出SOFT技术,通过选择性混淆数据来减轻大型语言模型微调中的隐私风险,在保持模型性能的同时有效抵御成员推理攻击。
English Summary: This study introduces SOFT, a defense technique that mitigates privacy risks in fine-tuned large language models by selectively obfuscating data, effectively balancing utility and protection against membership inference attacks.
Authors:Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
Abstract:
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
中文: TempVS基准测试评估多模态大语言模型在图像序列中的时序推理能力,揭示了与人类能力间的显著差距,并为未来研究提供了方向性见解。
English: The TempVS benchmark evaluates Multimodal Large Language Models' temporal reasoning in image sequences, revealing significant performance gaps compared to humans while offering insights for future research.
Authors:Jintao Liang, Gang Su, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follows fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems. Our collection of the relevant research has been organized into a https://github.com/ByebyeMonica/Reasoning-Agentic-RAG.
中文摘要:推理智能RAG通过嵌入自主决策和动态工具使用,克服了传统静态检索增强生成在复杂现实场景中的局限性,推动了自适应推理系统的发展。
English Summary: Reasoning Agentic RAG enhances traditional retrieval-augmented generation by integrating autonomous decision-making and dynamic tool use to address complex real-world reasoning challenges.
Authors:Jiaqi Lv, Xufeng He, Yanchen Liu, Xu Dai, Aocheng Shen, Yinghao Li, Jiachen Hao, Jianrong Ding, Yang Hu, Shouyi Yin
Abstract:
The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant position in the field of parallel software. This dominance requires other hardware platforms to support CUDA-based software with performance portability. However, translating CUDA code to other platforms poses significant challenges due to differences in parallel programming paradigms and hardware architectures. Existing approaches rely on language extensions, domain-specific languages (DSLs), or compilers but face limitations in workload coverage and generalizability. Moreover, these methods often incur substantial development costs. Recently, LLMs have demonstrated extraordinary potential in various vertical domains, especially in code-related tasks. However, the performance of existing LLMs in CUDA transpilation, particularly for high-performance code, remains suboptimal. To address these challenges, we propose a novel framework for generating high-performance CUDA and corresponding platform code pairs, leveraging AI compiler and automatic optimization technology. We further enhance the framework with a graph-based data augmentation method and introduce HPCTransEval, a benchmark for evaluating LLM performance on CUDA transpilation. We conduct experiments using CUDA-to-CPU transpilation as a case study on leading LLMs. The speedup ratio of the CPU operators has an average improvemnet of 43.8\%, highlighting the potential of LLMs to address compatibility challenges within the CUDA ecosystem. Our code is available at https://github.com/PJLAB-CHIP/HPCTransCompile.
中文摘要:深度学习快速发展加剧了计算需求,尽管NVIDIA的CUDA生态系统在并行计算领域占据主导地位却给其他硬件平台带来兼容性挑战,为此提出的新型AI增强框架显著提升了代码转译性能。
English Summary: The rapid expansion of deep learning has intensified computational demands, with NVIDIA's CUDA ecosystem dominating parallel computing despite creating compatibility challenges for other hardware platforms, leading to the proposal of a novel AI-enhanced framework that significantly improves transpilation performance.
Authors:Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang, Lei Bai
Abstract:
Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.
中文: 本文提出ReconMOST,一种数据驱动的引导扩散模型,利用历史模拟数据和稀疏观测数据重建全球多层海水温度,能有效处理超过92.5%的数据缺失,同时保持高精度和鲁棒性。
English: This paper introduces ReconMOST, a data-driven guided diffusion model that reconstructs global, multi-layer sea temperatures by leveraging historical simulation data and sparse observational inputs, effectively handling over 92.5% missing data with high accuracy and robustness.
Authors:Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin
Abstract:
Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.
中文: DART提出了一种动态自适应区域分词器,通过生成内容感知的不同尺寸图像块,智能地将更高密度的令牌分配给信息丰富区域,从而在提升性能的同时显著加快推理速度,为下一代高效视觉模型奠定基础。
English: DART introduces a dynamic adaptive region tokenizer that creates content-aware patches of varying sizes, enabling more efficient and capable vision models by intelligently allocating token density to information-rich regions while improving both performance and inference speed.
Authors:Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin
Abstract:
The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation. Code is available at https://github.com/HCPLab-SYSU/DART.
中文: DART提出了一种动态自适应区域分词器,通过生成内容感知的不同尺寸图像块,智能地将更高密度的令牌分配给信息丰富区域,从而在提升性能的同时显著加快推理速度,为下一代高效视觉模型奠定基础。
English: DART introduces a dynamic adaptive region tokenizer that creates content-aware patches of varying sizes, enabling more efficient and capable vision models by intelligently allocating token density to information-rich regions while improving both performance and inference speed.
Authors:Yuhang Chen, Zhen Tan, Tianlong Chen
Abstract:
Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9\% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM. The code and dataset can be found here https://github.com/UNITES-Lab/EQA-RM.
中文: 本文提出了针对具身问答任务的生成式多模态奖励模型EQA-RM,该模型能提供可解释的反馈并以少量样本实现优于主流模型的性能,同时建立了标准化评估新基准。
English: This paper introduces EQA-RM, a generative multimodal reward model for Embodied Question Answering, which provides interpretable feedback and outperforms leading models with high sample efficiency, alongside a new benchmark for standardized evaluation.
Authors:Xiaohan Yu, Pu Jian, Chong Chen
Abstract:
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
中文: TableRAG提出了一种基于SQL的框架,通过保留表格结构和实现复杂推理,有效解决了现有RAG方法在处理异构文档时的局限性,在公共数据集和新开发的HeteQA基准测试中均达到了最优性能。
English: TableRAG introduces an SQL-based framework that overcomes the limitations of existing RAG methods in handling heterogeneous documents by preserving tabular structures and enabling complex reasoning, achieving state-of-the-art performance on both public datasets and the newly developed HeteQA benchmark.
Authors:Xiaohan Yu, Pu Jian, Chong Chen
Abstract:
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
中文: TableRAG提出了一种基于SQL的框架,通过保留表格结构和实现复杂推理,有效解决了现有RAG方法在处理异构文档时的局限性,在公共数据集和新开发的HeteQA基准测试中均达到了最优性能。
English: TableRAG introduces an SQL-based framework that overcomes the limitations of existing RAG methods in handling heterogeneous documents by preserving tabular structures and enabling complex reasoning, achieving state-of-the-art performance on both public datasets and the newly developed HeteQA benchmark.
Authors:Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu
Abstract:
The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.
中文: 提出的FSATFusion网络通过引入频空注意力Transformer模块,克服了卷积神经网络在全局上下文捕捉上的局限性,在红外与可见光图像融合任务中展现出卓越的融合性能和泛化能力。
English: The proposed FSATFusion network overcomes the limitations of convolutional neural networks in capturing global context by incorporating a frequency-spatial attention Transformer module, demonstrating superior fusion performance and generalization capability in infrared and visible image fusion tasks.
Authors:Yanhui Li, Dongxia Wang, Zhu Sun, Haonan Zhang, Huizhong Guo
Abstract:
Recently, Graph Neural Networks (GNNs) have become the dominant approach for Knowledge Graph-aware Recommender Systems (KGRSs) due to their proven effectiveness. Building upon GNN-based KGRSs, Self-Supervised Learning (SSL) has been incorporated to address the sparity issue, leading to longer training time. However, through extensive experiments, we reveal that: (1)compared to other KGRSs, the existing GNN-based KGRSs fail to keep their superior performance under sparse interactions even with SSL. (2) More complex models tend to perform worse in sparse interaction scenarios and complex mechanisms, like attention mechanism, can be detrimental as they often increase learning difficulty. Inspired by these findings, we propose LightKG, a simple yet powerful GNN-based KGRS to address sparsity issues. LightKG includes a simplified GNN layer that encodes directed relations as scalar pairs rather than dense embeddings and employs a linear aggregation framework, greatly reducing the complexity of GNNs. Additionally, LightKG incorporates an efficient contrastive layer to implement SSL. It directly minimizes the node similarity in original graph, avoiding the time-consuming subgraph generation and comparison required in previous SSL methods. Experiments on four benchmark datasets show that LightKG outperforms 12 competitive KGRSs in both sparse and dense scenarios while significantly reducing training time. Specifically, it surpasses the best baselines by an average of 5.8\% in recommendation accuracy and saves 84.3\% of training time compared to KGRSs with SSL. Our code is available at https://github.com/1371149/LightKG.
中文: 图神经网络在知识图谱感知推荐系统中应用广泛,但面临稀疏交互和复杂性增加的挑战,因此提出了LightKG这一简化模型,既提升了推荐准确性又显著缩短了训练时间。
English: Graph Neural Networks (GNNs) are widely used in Knowledge Graph-aware Recommender Systems but struggle with sparse interactions and increased complexity, prompting the development of LightKG, a simplified GNN model that enhances performance and reduces training time.
Authors:Zhanwei Zhang, Kaiyuan Liu, Junjie Liu, Wenxiao Wang, Binbin Lin, Liang Xie, Chen Shen, Deng Cai
Abstract:
Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.
Chinese: GeoCAD提出了一种用户友好的局部几何可控CAD生成方法,通过互补标注策略和大语言模型,依据几何指令修改CAD模型的特定局部部件,有效解决了以往方法在文本引导和局部聚焦方面的不足。
English: GeoCAD introduces a user-friendly method for local geometry-controllable CAD generation that uses complementary captioning and LLMs to modify specific parts of CAD models according to geometric instructions, overcoming previous limitations in text guidance and local focus.
Authors:Aaryam Sharma
Abstract:
Air pollution has become a significant health risk in developing countries. While governments routinely publish air-quality index (AQI) data to track pollution, these values fail to capture the local reality, as sensors are often very sparse. In this paper, we address this gap by predicting AQI in 1 km^2 neighborhoods, using the example of AirDelhi dataset. Using Spatio-temporal GNNs we surpass existing works by 71.654 MSE a 79% reduction, even on unseen coordinates. New insights about AQI such as the existence of strong repetitive short-term patterns and changing spatial relations are also discovered. The code is available on GitHub.
Chinese: 本文通过使用时空图神经网络精确预测1平方公里区域的空气质量指数,解决了传感器稀疏的局限,将误差降低了79%,并揭示了污染数据中的新规律。
English: This paper tackles the limitations of sparse air-quality sensors by using spatio-temporal graph neural networks to accurately predict AQI in 1 km² areas, achieving a 79% reduction in error and uncovering new patterns in pollution data.
Authors:Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone
Abstract:
Developing AI agents that can robustly adapt to dramatically different strategic landscapes without retraining is a central challenge for multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with an extraordinarily large space of possible team configurations of approximately $10^{139}$ - far larger than those of Dota or Starcraft. The highly discrete, combinatorial nature of team building in Pokémon VGC causes optimal strategies to shift dramatically depending on both the team being piloted and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies human-play datasets and a range of baselines - from large-language-model agents and behavior cloning to reinforcement learning and empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated on a single-team configuration, our methods are able to win against a professional VGC competitor. We extensively evaluated all baseline methods over progressively larger team sets and find that even the best-performing algorithm in the single-team setting struggles at scaling up as team size grows. Thus, policy generalization across diverse team strategies remains an open challenge for the community. Our code is open sourced at https://github.com/cameronangliss/VGC-Bench.
中文摘要:VGC-Bench作为新基准被提出,旨在解决AI智能体在宝可梦VGC巨大策略空间中泛化适应的核心难题,现有方法在单队伍配置中表现良好但难以扩展到多队伍场景。
English Summary: VGC-Bench is introduced as a new benchmark to address the challenge of developing AI agents that can generalize across Pokémon VGC's vast strategic landscape, where current methods succeed in single-team scenarios but fail to scale effectively.
Authors:Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li
Abstract:
Recent advances in medical imaging have established deep learning-based segmentation as the predominant approach, though it typically requires large amounts of manually annotated data. However, obtaining annotations for intracranial hemorrhage (ICH) remains particularly challenging due to the tedious and costly labeling process. Semi-supervised learning (SSL) has emerged as a promising solution to address the scarcity of labeled data, especially in volumetric medical image segmentation. Unlike conventional SSL methods that primarily focus on high-confidence pseudo-labels or consistency regularization, we propose SWDL-Net, a novel SSL framework that exploits the complementary advantages of Laplacian pyramid and deep convolutional upsampling. The Laplacian pyramid excels at edge sharpening, while deep convolutions enhance detail precision through flexible feature mapping. Our framework achieves superior segmentation of lesion details and boundaries through a difference learning mechanism that effectively integrates these complementary approaches. Extensive experiments on a 271-case ICH dataset and public benchmarks demonstrate that SWDL-Net outperforms current state-of-the-art methods in scenarios with only 2% labeled data. Additional evaluations on the publicly available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data further confirm the superiority of our approach. Code and data have been released at https://github.com/SIAT-CT-LAB/SWDL.
中文摘要:SWDL-Net是一种新型半监督学习框架,通过结合拉普拉斯金字塔边缘锐化和深度卷积上采样,在仅使用少量标注数据的情况下实现了优越的颅内出血分割效果。
English Summary: SWDL-Net is a novel semi-supervised learning framework that combines Laplacian pyramid edge sharpening with deep convolutional upsampling to achieve superior intracranial hemorrhage segmentation using minimal labeled data.
Authors:Paul Janson, Benjamin Therien, Quentin Anthony, Xiaolong Huang, Abhinav Moudgil, Eugene Belilovsky
Abstract:
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo
中文: PyLO库通过基于PyTorch的解决方案,使学习优化器在实际应用中更易用且高效,显著提升训练速度并兼容现有优化工具。
English: The PyLO library introduces a PyTorch-based solution to make learned optimizers accessible and practical for real-world applications, offering significant speed improvements and compatibility with existing optimization tools.
Authors:Hossein A. Rahmani, Varsha Ramineni, Nick Craswell, Bhaskar Mitra, Emine Yilmaz
Abstract:
Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.
中文: 本文深入研究了由大型语言模型生成的合成测试集的可靠性,发现其在系统绝对性能评估中存在显著偏差,但对相对性能比较影响较小。
English: This paper thoroughly investigates the reliability of synthetic test collections generated by Large Language Models (LLMs), revealing significant biases in absolute system performance evaluation but less impact on relative comparisons.
Authors:Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra
Abstract:
Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.
中文: 本文深入研究了由大型语言模型生成的合成测试集的可靠性,发现其在系统绝对性能评估中存在显著偏差,但对相对性能比较影响较小。
English: This paper thoroughly investigates the reliability of synthetic test collections generated by Large Language Models (LLMs), revealing significant biases in absolute system performance evaluation but less impact on relative comparisons.
Authors:Han Wang, Di Wu, Lin Cheng, Shengping Gong, Xu Huang
Abstract:
Infinite-time nonlinear optimal regulation control is widely utilized in aerospace engineering as a systematic method for synthesizing stable controllers. However, conventional methods often rely on linearization hypothesis, while recent learning-based approaches rarely consider stability guarantees. This paper proposes a learning-based framework to learn a stable optimal controller for nonlinear optimal regulation problems. First, leveraging the equivalence between Pontryagin Maximum Principle (PMP) and Hamilton-Jacobi-Bellman (HJB) equation, we improve the backward generation of optimal examples (BGOE) method for infinite-time optimal regulation problems. A state-transition-matrix-guided data generation method is then proposed to efficiently generate a complete dataset that covers the desired state space. Finally, we incorporate the Lyapunov stability condition into the learning framework, ensuring the stability of the learned optimal policy by jointly learning the optimal value function and control policy. Simulations on three nonlinear optimal regulation problems show that the learned optimal policy achieves near-optimal regulation control and the code is provided at https://github.com/wong-han/PaperNORC
中文: 本文提出了一种基于学习的框架,通过结合李雅普诺夫稳定性条件,为非线性最优调节问题学习稳定的控制器,并在仿真中实现了接近最优的控制效果。
English: This paper introduces a learning-based framework that integrates Lyapunov stability conditions to develop a stable optimal controller for nonlinear regulation, utilizing improved data generation and demonstrating near-optimal performance in simulations.
Authors:Eunkyu Park, Minyeong Kim, Gunhee Kim
Abstract:
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.
Chinese: 为解决视觉语言模型中的幻觉问题,研究者提出了HalLoc数据集和基线模型,通过概率性检测和低开销集成,有效提升模型在实际应用中的可靠性。
English: Hallucinations in vision-language models undermine reliability, so the authors introduce HalLoc, a dataset and baseline model for efficient, probabilistic detection that integrates seamlessly to enhance trustworthiness in real-world applications.
Authors:Emerson P. Grabke, Masoom A. Haider, Babak Taati
Abstract:
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, non-medical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM framework centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74%. Training a classifier solely on our method's synthetic images achieved comparable performance to training on real images alone. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric framework enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Code from this study will be available at https://github.com/grabkeem/CCELLA
中文摘要:本研究提出CCELLA这一新型双头调节方法,通过结合自由文本临床报告和放射学分类来增强医学影像的潜在扩散模型,在有限数据和最少人工标注条件下显著提升了合成图像质量及下游分类器性能。
English Summary: The study introduces CCELLA, a novel dual-head conditioning approach that enhances latent diffusion models for medical imaging by utilizing free-text clinical reports and radiology classifications, significantly improving synthetic image quality and downstream classifier performance with limited data and minimal human annotation.
Authors:Yuhui Ding, Thomas Hofmann
Abstract:
Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at https://github.com/skeletondyh/RADM
中文摘要:我们的方法通过放宽严格的等变性约束,利用样本相关的分子对齐构建潜在空间,并采用非等变扩散模型,在保持顶尖生成质量的同时显著提升了3D分子生成的训练与采样效率。
English Summary: Our method enhances 3D molecule generation by relaxing rigid equivariance constraints, achieving state-of-the-art quality with improved efficiency through sample-dependent molecular alignment and non-equivariant diffusion modeling.
Authors:Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg, Bryan Russell
Abstract:
Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
Chinese: 本文提出了一种个性化视觉语言检索方法,通过对语言编码器进行正则化低秩参数调整来适应双编码器模型,有效识别个人概念同时保留通用知识,并在基准测试中取得了最先进的性能。
English: This paper introduces a method for personalized vision-language retrieval by adapting a dual encoder model through regularized low-rank parameter adjustments in the language encoder, effectively recognizing personal concepts while preserving general knowledge and achieving state-of-the-art results on benchmarks.
Authors:Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias
Abstract:
As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency.
In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.
中文摘要:本研究提出了高效探测(EP)方法,通过减少冗余参数实现参数高效的注意力探测,在多个基准测试中优于线性探测及现有方法,并揭示了其新兴特性以拓展应用前景。
English Summary: This study introduces Efficient Probing (EP), a parameter-efficient attentive probing method that outperforms linear probing and existing approaches across multiple benchmarks while offering insights into its emergent properties for broader applications.
Authors:Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias
Abstract:
As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design. Code available at https://github.com/billpsomas/efficient-probing.
中文摘要:本研究提出了高效探测(EP)方法,通过减少冗余参数实现参数高效的注意力探测,在多个基准测试中优于线性探测及现有方法,并揭示了其新兴特性以拓展应用前景。
English Summary: This study introduces Efficient Probing (EP), a parameter-efficient attentive probing method that outperforms linear probing and existing approaches across multiple benchmarks while offering insights into its emergent properties for broader applications.
Authors:Yael Frischholz, Devis Tuia, Michael Lehning
Abstract:
Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git
中文: 该研究提出的基于注意力的模拟器通过从卫星图像序列中学习晴空反射率动态,有效反演地表太阳辐射,无需人工特征且在复杂山区表现尤为突出。
English: The proposed attention-based emulator effectively retrieves surface solar radiation by learning clear-sky reflectance dynamics from satellite imagery, eliminating manual features and excelling particularly in complex mountainous terrain.
Authors:Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai, Shizheng Wang, Yang Long, Yefeng Zheng
Abstract:
Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48\% (ranging from 2.39\% to 7.72\%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.
中文: 提出的谐波频率融合网络(HFF-Net)通过频域分析磁共振图像来解决脑肿瘤分割难题,在保持计算效率的同时显著提升了对比增强区域的 segmentation 精度。
English: The proposed Harmonized Frequency Fusion Network (HFF-Net) addresses brain tumor segmentation challenges by analyzing MRI images in the frequency domain, achieving significant improvements in accuracy for contrast-enhancing regions while maintaining computational efficiency.
Authors:Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala
Abstract:
The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.
Chinese Summary: 本研究通过系统分析速度与准确性的权衡,挑战了编程任务中优先使用全面验证器的共识,证明结果奖励模型在生成-修剪-排序方法中能实现11.65倍的速度提升且仅损失8.33%准确率,为设计可扩展的排序系统提供了新思路。
English Summary: The study challenges the preference for comprehensive verifiers in coding tasks by demonstrating that outcome reward models (ORMs) offer a valuable speed-accuracy trade-off, enabling an 11.65x faster system with only 8.33% accuracy loss when used in a generate-prune-then-rank approach.
Authors:Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Abstract:
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
中文摘要:Omni-DPO是一种新颖的双视角优化框架,通过基于数据固有质量和模型学习动态的自适应加权机制改进直接偏好优化方法,在多个基准测试中实现了卓越性能。
English Summary: Omni-DPO is a novel dual-perspective optimization framework that enhances Direct Preference Optimization by adaptively weighting preference pairs based on both inherent data quality and the model's learning dynamics, achieving superior performance across various benchmarks.
Authors:Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, Babak Taati
Abstract:
Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at https://github.com/TaatiTeam/Token-Perturbation-Guidance
无分类器引导(CFG)虽能提升扩散模型的生成质量和对齐效果,但需特定训练且仅适用于条件生成;而提出的令牌扰动引导(TPG)通过直接扰动中间令牌表示,实现了无需训练、条件无关的通用引导方法,在无条件生成中显著优化FID指标并保持与提示的高度匹配。
Classifier-free guidance (CFG) improves diffusion models but requires specific training and is limited to conditional generation, whereas the proposed Token Perturbation Guidance (TPG) offers a training-free, condition-agnostic approach that enhances generation quality and alignment by perturbing token representations, achieving significant improvements in FID and prompt adherence.
Authors:Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens
Abstract:
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: https://www.yimingdou.com/hearing_hands/
中文摘要:本研究通过训练校正流模型学习动作-声音配对数据,开发出能够根据手部运动轨迹生成逼真物体交互声音的系统,其合成效果在材质传达和听觉真实性方面接近真实录音。
English Summary: This research develops a method to generate realistic sounds of hand-object interactions in 3D scenes by training a rectified flow model on action-sound pairs, enabling users to predict audio for various hand movements with high perceptual accuracy.
Authors:Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, Tsung-Yi Lin
Abstract:
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
中文: 本文提出了一种端到端的部件级三维物体生成框架,通过双体积打包策略从单张图像生成具有语义意义且可交错组合的完整部件,在质量和多样性上超越了现有方法。
English: This paper introduces an end-to-end framework for part-level 3D object generation that produces high-quality objects with semantically meaningful, interleaved parts from a single image using a dual volume packing strategy, outperforming previous methods in quality and flexibility.
Authors:Sushant Gautam, Michael A. Riegler, PÃ¥l Halvorsen
Abstract:
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
中文: Kvasir-VQA-x1是一个大规模胃肠道内窥镜数据集,通过15.9万个问题-答案对和模拟影像伪影的视觉增强,提升了临床复杂性和视觉多样性,旨在推动临床多模态AI系统的发展。
English: Kvasir-VQA-x1 is a large-scale gastrointestinal endoscopy dataset that expands clinical complexity and visual diversity through 159,549 question-answer pairs and augmented imaging artifacts, aiming to advance multimodal AI for clinical decision support.
Authors:Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck
Abstract:
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
中文摘要:本研究提出了一个视觉对话数据集,要求模型识别相关视频片段并利用外部知识回答视觉信息中未包含的问题,同时通过基线评估揭示了该任务未来的挑战。
English Summary: This study introduces a dataset for visually grounded dialogue tasks requiring models to identify relevant video segments and incorporate external knowledge to answer questions not present in the visual content, with baseline evaluations highlighting future challenges.
Authors:Ziyi Wang, Yanran Zhang, Jie Zhou, Jiwen Lu
Abstract:
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.
中文: UniPre3D首次提出适用于任意尺度和架构点云数据的统一预训练方法,通过高斯基元预测、可微分渲染结合二维特征融合,在物体和场景级任务中均展现出普适有效性。
English: UniPre3D introduces the first unified pre-training method for point clouds of any scale and architecture, using Gaussian primitives and differentiable rendering with 2D feature integration to achieve universal effectiveness across object- and scene-level tasks.
Authors:Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye
Abstract:
Recent work has identified retrieval heads, a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needlein-a-Haystack tasks. In this paper, we introduce QRHead (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHead by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QRRetriever, an efficient and effective retriever that uses the accumulated attention mass of QRHead as retrieval scores. We use QRRetriever for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRetriever as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the query-context attention scoring and task selection are crucial for identifying QRHead with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
Chinese: 本文提出QRHead——一种通过查询聚焦注意力评分来提升长上下文信息检索性能的改进注意力头,以及QRRetriever——利用该机制的高效检索器,在推理任务中实现显著性能提升并具备强大的零样本重排序能力。
English: This paper introduces QRHead, an enhanced attention head that improves information retrieval from long contexts by using query-focused attention scoring, and QRRetriever, an efficient retriever leveraging these heads to achieve significant performance gains in reasoning tasks and strong zero-shot re-ranking results.
Authors:Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.
中文: 本文提出了VerIF验证方法,结合基于规则和大型语言模型的验证技术,显著提升了指令跟随的强化学习效果,在保持通用能力的同时实现了最优性能与良好泛化能力。
English: This paper introduces VerIF, a verification method combining rule-based and LLM-based approaches to enhance reinforcement learning for instruction following, achieving state-of-the-art performance and generalizability without compromising general capabilities.
Authors:Jianhan Qi, Yuheng Jia, Hui Liu, Junhui Hou
Abstract:
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.
中文: 本研究提出结构-光谱图卷积算子和证据引导的自适应边学习模块,通过优化光谱信息利用和图结构,显著提升了高光谱图像聚类的准确性,在多个数据集上实现了明显的精度提升。
English: This study introduces a structural-spectral graph convolutional operator and an evidence-guided adaptive edge learning module to enhance hyperspectral image clustering by better utilizing spectral information and refining graph structures, achieving significant accuracy improvements across multiple datasets.
Authors:Yan Zhang, Li Deng, Lixin Duan, Sami Azam
Abstract:
Metric learning has attracted extensive interest for its ability to provide personalized recommendations based on the importance of observed user-item interactions. Current metric learning methods aim to push negative items away from the corresponding users and positive items by an absolute geometrical distance margin. However, items may come from imbalanced categories with different intra-class variations. Thus, the absolute distance margin may not be ideal for estimating the difference between user preferences over imbalanced items. To this end, we propose a new method, named discrete scale-invariant metric learning (DSIML), by adding binary constraints to users and items, which maps users and items into binary codes of a shared Hamming subspace to speed up the online recommendation. Specifically, we firstly propose a scale-invariant margin based on angles at the negative item points in the shared Hamming subspace. Then, we derive a scale-invariant triple hinge loss based on the margin. To capture more preference difference information, we integrate a pairwise ranking loss into the scale-invariant loss in the proposed model. Due to the difficulty of directly optimizing the mixed integer optimization problem formulated with \textit{log-sum-exp} functions, we seek to optimize its variational quadratic upper bound and learn hash codes with an alternating optimization strategy. Experiments on benchmark datasets clearly show that our proposed method is superior to competitive metric learning and hashing-based baselines for recommender systems. The implementation code is available at https://github.com/AnonyFeb/dsml.
中文摘要:本文提出DSIML方法,通过二进制约束将用户和物品映射到共享汉明子空间,采用基于角度的尺度不变边界来处理类别不平衡问题,显著提升了推荐系统的性能与效率。
English Summary: The paper introduces DSIML, a discrete scale-invariant metric learning method that uses binary constraints and angle-based margins in a shared Hamming subspace to improve recommendation accuracy for imbalanced item categories while accelerating online performance.
Authors:Siyu Chen, Ting Han, Chengzheng Fu, Changshe Zhang, Chaolei Wang, Jinhe Su, Guorong Cai, Meiliu Wu
Abstract:
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.
中文: Vireo是一个创新的单阶段框架,首次将开放词汇语义分割与领域泛化相结合,通过冻结的视觉基础模型和深度几何特征实现了在未知类别和领域中领先的鲁棒性表现。
English: Vireo is a novel single-stage framework that unifies open-vocabulary semantic segmentation with domain generalization, leveraging frozen visual foundation models and depth-based geometry to achieve state-of-the-art robustness across unseen categories and domains.
Authors:Panagiotis Kaliosis, John Pavlopoulos
Abstract:
Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.
中文摘要:本文提出了一种利用Wasserstein距离对齐字符频率分布的新型损失函数,无需重新训练模型即可提升手写文本识别的准确性和鲁棒性,有效应对时间和上下文变化。
English Summary: This paper introduces a novel loss function using Wasserstein distance to align character frequency distributions, improving handwritten text recognition accuracy and robustness against temporal and contextual shifts without requiring model retraining.
Authors:Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Abstract:
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.
中文:CoRT是一种后训练框架,通过提示工程集成代码解释器,有效提升大型推理模型在数学推理中的性能,显著减少计算资源消耗并提高准确率。
English: CoRT is a post-training framework that enhances Large Reasoning Models' efficiency in mathematical reasoning by integrating Code Interpreters through Hint-Engineering, achieving significant performance improvements and token reduction.
Authors:Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Abstract:
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
中文:ComfyUI-R1是首个通过两阶段训练框架实现自动化工作流生成的大型推理模型,在格式有效性和结构准确性方面显著超越了当前最先进的闭源模型。
English: ComfyUI-R1 is the first large reasoning model that automates AI workflow generation through a two-stage training framework, achieving superior performance in format validity and structural accuracy compared to leading closed-source models.
Authors:Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang
Abstract:
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
中文摘要:当前多模态大语言模型在推理过程中常未能有效整合视觉信息,为此我们提出了一个简单的视觉扰动框架,无需算法修改即可增强感知鲁棒性,显著提升数学推理性能。
English Summary: Current multimodal large language models often fail to effectively integrate visual information during reasoning, prompting the development of a simple visual perturbation framework that enhances perceptual robustness and improves mathematical reasoning performance without requiring algorithmic modifications.
Authors:Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Guilin Li, Bo Wang, Linghe Kong, Lichao Sun, Weiran Huang
Abstract:
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
中文摘要:当前多模态大语言模型在推理过程中常未能有效整合视觉信息,为此我们提出了一个简单的视觉扰动框架,无需算法修改即可增强感知鲁棒性,显著提升数学推理性能。
English Summary: Current multimodal large language models often fail to effectively integrate visual information during reasoning, prompting the development of a simple visual perturbation framework that enhances perceptual robustness and improves mathematical reasoning performance without requiring algorithmic modifications.
Authors:Ye Zhang, Yu Zhou, Yifeng Wang, Jun Xiao, Ziyue Wang, Yongbing Zhang, Jianxu Chen
Abstract:
Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at https://github.com/zhangye-zoe/FCIS.
中文: 本文提出了一种基于四色定理的新型细胞实例分割方法,将任务转化为简化的四类语义分割问题,并通过渐进式训练策略实现了最先进的性能。
English: This paper introduces a novel cell instance segmentation method based on the four-color theorem, transforming the task into a simplified four-class semantic segmentation problem and achieving state-of-the-art performance through an asymptotic training strategy.
Authors:Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi, Dong LI, Xin Liu, Daniel McDuff, Xiaojing Liu, Yuntao Wang
Abstract:
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
中文摘要:LADH数据集通过融合RGB与红外面部视频,结合多任务学习,在长期高原日常护理场景中提升了非接触式生理监测的鲁棒性,实现了心率估计4.99 BPM的平均绝对误差。
English Summary: The LADH dataset, featuring synchronized RGB and infrared facial videos from daily care scenarios, enhances remote physiological monitoring by combining multi-modal inputs and multi-task learning to achieve robust heart rate estimation with a 4.99 BPM MAE.
Authors:Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong, Mingxuan Liu, Yong Peng, Jin Fan, Feiwei Qin, Changmiao Wang
Abstract:
Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.
中文:提出的FasterSNN架构整合了LIF神经元、区域自适应卷积和多尺度脉冲注意力机制,能够利用3D核磁共振数据实现高效稳定的阿尔茨海默症筛查,在保持诊断准确性的同时显著提升了计算效率。
English: The proposed FasterSNN architecture combines LIF neurons with region-adaptive convolution and multi-scale attention to enable efficient and stable Alzheimer's Disease screening using 3D MRI data, achieving competitive performance with improved computational efficiency.
Authors:Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar
Abstract:
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.
中文摘要:本文通过将输入-输出对建模为双马尔可夫链,提出了基于逆模型的概率框架和Inv-熵不确定性度量方法,实验证明其在语义不确定性量化方面优于现有方法。
English Summary: This paper introduces a probabilistic framework for uncertainty quantification in large language models by modeling input-output pairs as dual Markov chains and proposing Inv-Entropy as a novel uncertainty measure, with experiments showing its superiority over existing methods.
Authors:Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh, Nadine Girard, Guillaume Auzias, Daniel Rueckert
Abstract:
Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at https://github.com/m-dannecker/CINeMA.
Chinese: CINeMA是一种新型框架,通过在潜在空间操作,能在低数据环境下快速构建高分辨率多模态脑图谱,支持对解剖特征的灵活条件设置,并实现分割和数据增强等下游任务。
English: CINeMA is a novel framework that rapidly constructs high-resolution, multimodal brain atlases in low-data settings by operating in latent space, enabling flexible conditioning on anatomical features and supporting downstream tasks like segmentation and data augmentation.
Authors:Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Abstract:
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at https://github.com/KPeng9510/HopaDIFF.git.
中文: 本文开创性地提出文本参照引导的多人物动作分割方法,构建了首个RHAS133数据集,并设计了HopaDIFF框架——通过交叉输入门控注意力机制和傅里叶条件调控实现整体-局部感知的扩散模型,在多样化评估设置中取得了最优性能。
English: This paper introduces a novel textual reference-guided approach for multi-person action segmentation, presenting the RHAS133 dataset and proposing HopaDIFF—a holistic-partial aware diffusion framework that achieves state-of-the-art performance by enhancing long-range reasoning and fine-grained control through cross-input gate attention and Fourier conditioning.
Authors:Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Abstract:
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.
中文: 本文开创性地提出文本参照引导的多人物动作分割方法,构建了首个RHAS133数据集,并设计了HopaDIFF框架——通过交叉输入门控注意力机制和傅里叶条件调控实现整体-局部感知的扩散模型,在多样化评估设置中取得了最优性能。
English: This paper introduces a novel textual reference-guided approach for multi-person action segmentation, presenting the RHAS133 dataset and proposing HopaDIFF—a holistic-partial aware diffusion framework that achieves state-of-the-art performance by enhancing long-range reasoning and fine-grained control through cross-input gate attention and Fourier conditioning.
Authors:Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang
Abstract:
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.
中文: 大语言模型因知识过时和幻觉问题影响可靠性,而RAPL框架通过创新的图检索方法增强结构化推理和泛化能力,有效提升了知识图谱问答的性能。
English: Large Language Models face reliability issues due to outdated knowledge and hallucinations, which the proposed RAPL framework addresses through a novel graph retrieval approach that enhances structured reasoning and generalizability in knowledge graph question answering.
Authors:Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng, Yinong Wang, Liangqiong Qu
Abstract:
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at https://github.com/YanzhaoShi/HSENet.
中文: HSENet提出了一种混合空间编码框架,通过双3D视觉编码器和基于质心的压缩技术提升对三维CT影像的多模态理解能力,在医学视觉语言任务中实现了最优性能。
English: HSENet introduces a hybrid spatial encoding framework that leverages dual-3D vision encoders and centroid-based compression to enhance multimodal understanding of 3D CT scans, achieving state-of-the-art performance in medical vision-language tasks.
Authors:Giacomo Rosin, Muhammad Rameez Ur Rahman, Sebastiano Vascon
Abstract:
Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at https://github.com/CVML-CFU/ECAM.
Chinese Summary: 本文提出的ECAM模块通过对比学习增强轨迹预测模型的环境避障能力,在ETH/UCY数据集上实现碰撞率降低40-50%。
English Summary: This paper introduces ECAM, a contrastive learning module that enhances trajectory forecasting models' environmental collision avoidance, reducing collision rates by 40-50% on the ETH/UCY dataset.
Authors:Lipei Xie, Yingxin Li, Huiping Zhuang
Abstract:
Embodied foundation models are crucial for Artificial Intelligence (AI) interacting with the physical world by integrating multi-modal inputs, such as proprioception, vision and language, to understand human intentions and generate actions to control robots. While these models demonstrate strong generalization and few-shot learning capabilities, they face significant challenges in continually acquiring new skills without forgetting previously learned skills, a problem known as catastrophic forgetting. To address this issue, we propose the Analytic Task Scheduler (ATS), a novel framework for continual learning in embodied foundation models. ATS consists of a task-specific model library, where each model is fine-tuned independently on a single task, and an analytic scheduler trained using recursive least squares (RLS) to learn the mapping between language instructions and task-specific models. This architecture enables accurate task recognition and dynamic model selection while fundamentally avoiding parameter interference across tasks. The scheduler updates its parameters incrementally using only statistics (autocorrelation and cross-correlation matrices), enabling forgetting-resistant learning without the need to revisit historical data. We validate ATS on a real-world robot platform (RM65B), demonstrating superior resistance to forgetting and strong adaptability to task variations. The results highlight ATS as an effective, scalable, and deployable solution for continual learning in embodied foundation models operating in complex, dynamic environments. Our code will be available at https://github.com/MIAA-Embodied-AI/AnalyticTaskScheduler
Chinese: 分析任务调度器(ATS)框架通过任务特定模型库和分析调度器,根据语言指令动态选择模型,解决了具身基础模型中的灾难性遗忘问题,实现了无需历史数据的持续学习,并在真实机器人平台上展现出优异性能。
English: The Analytic Task Scheduler (ATS) framework addresses catastrophic forgetting in embodied foundation models by using a task-specific model library and an analytic scheduler to dynamically select models based on language instructions, enabling continual learning without revisiting historical data and demonstrating strong performance on a real robot platform.
Authors:Mingxiao Li, Mang Ning, Marie-Francine Moens
Abstract:
Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.
中文: 本文提出了一种无需训练、结合非对称提示和视觉共享的Z字形采样方法,显著提升了视觉故事生成中的主题一致性,效果优于现有技术。
English: This paper introduces a training-free Zigzag Sampling method with asymmetric prompts and visual sharing to significantly improve subject consistency in visual story generation, outperforming previous approaches.
Authors:Songze Li, Mingxuan Zhang, Kang Wei, Shouling Ji
Abstract:
Deep reinforcement learning (DRL) has achieved remarkable success in a wide range of sequential decision-making domains, including robotics, healthcare, smart grids, and finance. Recent research demonstrates that attackers can efficiently exploit system vulnerabilities during the training phase to execute backdoor attacks, producing malicious actions when specific trigger patterns are present in the state observations. However, most existing backdoor attacks rely primarily on simplistic and heuristic trigger configurations, overlooking the potential efficacy of trigger optimization. To address this gap, we introduce TooBadRL (Trigger Optimization to Boost Effectiveness of Backdoor Attacks on DRL), the first framework to systematically optimize DRL backdoor triggers along three critical axes, i.e., temporal, spatial, and magnitude. Specifically, we first introduce a performance-aware adaptive freezing mechanism for injection timing. Then, we formulate dimension selection as a cooperative game, utilizing Shapley value analysis to identify the most influential state variable for the injection dimension. Furthermore, we propose a gradient-based adversarial procedure to optimize the injection magnitude under environment constraints. Evaluations on three mainstream DRL algorithms and nine benchmark tasks show that TooBadRL significantly improves attack success rates, while ensuring minimal degradation of normal task performance. These results highlight the previously underappreciated importance of principled trigger optimization in DRL backdoor attacks. The source code of TooBadRL can be found at https://github.com/S3IC-Lab/TooBadRL.
Chinese: 深度强化学习面临优化后门攻击的严重威胁,TooBadRL框架通过系统优化触发器的时间、空间和强度维度,显著提升攻击成功率,同时确保正常任务性能影响最小。
English: Deep reinforcement learning faces significant threats from optimized backdoor attacks, as demonstrated by the TooBadRL framework, which systematically enhances trigger effectiveness across temporal, spatial, and magnitude dimensions to achieve high attack success with minimal performance impact.
Authors:Ligao Deng, Yupeng Deng, Yu Meng, Jingbo Chen, Zhihao Xi, Diyou Liu, Qifeng Chu
Abstract:
Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at https://github.com/ucas-dlg/GLD-Road.
中文: GLD-Road是一个两阶段模型,通过全局检测节点并连接,再局部迭代修复断裂道路,高效提取道路网络,在精度和计算效率上均显著优于现有方法。
English: GLD-Road is a two-stage model that efficiently extracts road networks by globally detecting nodes and connecting them, then locally refining broken roads, achieving superior accuracy and significantly reducing computation time compared to existing methods.
Authors:Beomsik Cho, Jaehyung Kim
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.
中文: ReVisiT是一种新颖的解码方法,通过在文本生成过程中动态参考视觉标记来增强大型视觉语言模型的视觉基础能力,无需额外训练即可显著减少幻觉并降低计算成本。
English: ReVisiT is a novel decoding method that enhances visual grounding in Large Vision-Language Models by dynamically referencing vision tokens during text generation, significantly reducing hallucinations and computational costs without requiring additional training.
Authors:Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong
Abstract:
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
中文: ReasonMed作为迄今最大的医学推理数据集,通过多智能体生成与验证流程构建了37万高质量样本,其训练的模型在医学问答中表现卓越,如ReasonMed-7B在PubMedQA上较优模型提升超4.6%。
English: ReasonMed, the largest medical reasoning dataset with 370k high-quality examples, enhances LLM training by integrating detailed reasoning with concise summaries, enabling models like ReasonMed-7B to surpass previous benchmarks by over 4% on medical QA tasks.
Authors:Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong
Abstract:
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
中文: ReasonMed作为迄今最大的医学推理数据集,通过多智能体生成与验证流程构建了37万高质量样本,其训练的模型在医学问答中表现卓越,如ReasonMed-7B在PubMedQA上较优模型提升超4.6%。
English: ReasonMed, the largest medical reasoning dataset with 370k high-quality examples, enhances LLM training by integrating detailed reasoning with concise summaries, enabling models like ReasonMed-7B to surpass previous benchmarks by over 4% on medical QA tasks.
Authors:Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
Abstract:
Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.
中文: 大语言模型在不同硬件配置下因浮点运算的非结合性导致结果可复现性脆弱,推理模型准确率波动高达9%,为此开发了LayerCast轻量推理框架以平衡计算稳定性与内存效率。
English: Large Language Models exhibit fragile reproducibility due to floating-point arithmetic variations under different hardware configurations, with reasoning models showing up to 9% accuracy fluctuations, prompting the development of LayerCast for stable inference.
Authors:Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon
Abstract:
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.
中文: 本文介绍了BemaGANv2这一改进的GAN声码器,其生成器采用带Snake激活函数的AMP模块,判别器结合新型MED与MRD技术,能实现高质量音频生成,并通过完整评估和开源代码确保可复现性。
English: This paper introduces BemaGANv2, an enhanced GAN vocoder featuring AMP modules with Snake activation in the generator and a novel MED discriminator combined with MRD for superior high-fidelity audio generation, supported by comprehensive evaluations and open-source implementation.
Authors:Tianxiang Hao, Lixian Zhang, Yingjia Zhang, Mengxuan Chen, Jinxiao Zhang, Haohuan Fu
Abstract:
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.
中文:该研究提出了Urban1960SatBench——目前最早的带标注历史卫星图像分割数据集,以及Urban1960SatUSM无监督框架,通过置信度感知机制有效提升了退化城市场景的分割效果。
English: The study introduces Urban1960SatBench, the earliest annotated historical satellite imagery segmentation dataset, and Urban1960SatUSM, an unsupervised framework that enhances segmentation of degraded urban scenes through confidence-aware mechanisms.
Authors:Amirreza Khoshbakht, Erchan Aptoula
Abstract:
Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at https://github.com/amir-khb/SSUDOSDG upon acceptance.
中文摘要:本文提出了一种新的高光谱图像开集域泛化框架,通过结合频谱不变频率解耦、双通道残差网络、证据深度学习和谱空不确定性解耦四大核心组件,在无需目标域训练数据的情况下实现了优异的跨域分类性能。
English Summary: This paper introduces a novel open-set domain generalization framework for hyperspectral image classification, integrating frequency disentanglement, dual-channel networks, evidential learning, and uncertainty disentanglement to achieve robust cross-domain performance without requiring target domain data during training.
Authors:Prameshwar Thiyagarajan, Vaishnavi Parimi, Shamant Sai, Soumil Garg, Zhangir Meirbek, Nitin Yarlagadda, Kevin Zhu, Chris Kim
Abstract:
Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.
Chinese: UniToMBench作为一个统一基准,通过整合多交互任务和动态情景来提升和评估大语言模型的心理理论能力,结果表明尽管GPT-4o等模型在情感和信念任务中表现出色,但在知识型任务中的表现存在显著差异。
English: UniToMBench is a unified benchmark designed to enhance and evaluate Theory of Mind capabilities in large language models by integrating multi-interaction tasks and evolving scenarios, revealing that while models like GPT-4o excel in emotional and belief-based tasks, their performance varies significantly in knowledge-based contexts.
Authors:Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, Shouling Ji
Abstract:
Large Language Models (LLMs) have demonstrated remarkable intelligence across various tasks, which has inspired the development and widespread adoption of LLM-as-a-Judge systems for automated model testing, such as red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising concerns about their robustness and, consequently, their trustworthiness. Existing evaluation methods adopted by LLM-based judges are often piecemeal and lack a unified framework for comprehensive assessment. Furthermore, prompt template and model selections for improving judge robustness have been rarely explored, and their performance in real-world settings remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. RobustJudge investigates the impact of attack methods and defense strategies (RQ1), explores the influence of prompt template and model selection (RQ2), and assesses the robustness of real-world LLM-as-a-Judge applications (RQ3).Our main findings are: (1) LLM-as-a-Judge systems are still vulnerable to a range of adversarial attacks, including Combined Attack and PAIR, while defense mechanisms such as Re-tokenization and LLM-based Detectors offer improved protection; (2) Robustness is highly sensitive to the choice of prompt template and judge models. Our proposed prompt template optimization method can improve robustness, and JudgeLM-13B demonstrates strong performance as a robust open-source judge; (3) Applying RobustJudge to Alibaba's PAI platform reveals previously unreported vulnerabilities. The source code of RobustJudge is provided at https://github.com/S3IC-Lab/RobustJudge.
中文摘要:大型语言模型作为评判系统易受对抗性攻击影响,为此提出的RobustJudge自动化框架系统评估了其在不同攻击方法、防御策略及实际应用中的鲁棒性。
English Summary: Large Language Models (LLM)-as-a-Judge systems face vulnerabilities to adversarial attacks, prompting the development of RobustJudge, an automated framework that evaluates their robustness across attack methods, defense strategies, and real-world applications.
Authors:Xinya Liu, Jianghao Wu, Tao Lu, Shaoting Zhang, Guotai Wang
Abstract:
Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM's zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: https://github.com/HiLab-git/SRPL-SFDA.
中文: 本文提出SRPL-SFDA方法,通过SAM引导的可靠伪标签系统,结合测试时增强和可靠性感知训练,在无源域适应的医学图像分割中取得了接近监督学习的性能。
English: This paper introduces SRPL-SFDA, a source-free domain adaptation method that enhances medical image segmentation by using SAM-guided reliable pseudo-labels through test-time enhancement and reliability-aware training, achieving performance close to supervised training.
Authors:Kaiyu Guo, Zijian Wang, Tan Pan, Brian C. Lovell, Mahsa Baktashmotlagh
Abstract:
Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released at https://github.com/workerbcd/ooddcc.
Chinese: 本文提出了一种动态方法,通过实时调整先验几何以纠正分布不良样本导致的失真,显著提升了多种模型的分布外检测性能,其核心在于对协方差矩阵进行实时优化。
English: This paper introduces a dynamic method that adjusts the prior geometry to correct distortions from ill-distributed samples, significantly improving Out-of-Distribution detection across multiple models by refining the covariance matrix in real-time.
Authors:Haiyang Yu, Yuchao Lin, Xuan Zhang, Xiaofeng Qian, Shuiwang Ji
Abstract:
We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the off-diagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch-Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features, mimicking the symmetric contraction operation. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures. Our code will be released as part of the AIRS library https://github.com/divelab/AIRS.
中文: QHNetV2通过引入高效的SO(2)等变操作在局部坐标系中实现全局SO(3)等变性,在预测哈密顿矩阵的任务中展现了卓越性能和泛化能力,同时避免了昂贵的SO(3)张量积计算。
English: QHNetV2 introduces efficient SO(2)-equivariant operations within local frames to achieve global SO(3) equivariance for predicting Hamiltonian matrices, demonstrating superior performance and generalization on molecular datasets while eliminating costly SO(3) tensor products.
Authors:Jialong Zuo, Yongtai Deng, Mengdan Tan, Rui Jin, Dongyue Wu, Nong Sang, Liang Pan, Changxin Gao
Abstract:
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.
中文摘要:本研究提出了OM-ReID这一新型多模态行人重识别方法,通过构建ORBench数据集和ReID5o框架,实现了任意模态组合的协同融合与跨模态对齐,为多模态检索提供了有效解决方案。
English Summary: This study introduces OM-ReID, a novel approach for person re-identification using diverse multi-modal queries, supported by the newly created ORBench dataset and the ReID5o framework that enables effective cross-modal fusion and alignment.
Authors:Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen
Abstract:
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
Chinese: 该研究提出位置偏好优化(LPO)方法,通过信息熵和动态奖励机制利用位置数据提升自主代理与图形用户界面的交互精度,在离线和在线评估中均取得了最先进的性能表现。
English: The study introduces Location Preference Optimization (LPO), a novel method that enhances autonomous agents' interaction precision with Graphical User Interfaces by utilizing locational data through information entropy and dynamic rewards, achieving state-of-the-art results in both offline and online evaluations.
Authors:Zeran Ke, Bin Tan, Xianwei Zheng, Yujun Shen, Tianfu Wu, Nan Xue
Abstract:
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at https://github.com/ant-research/scalelsd
中文: 本文提出了ScaleLSD模型,通过自监督学习从超过1000万张未标注图像中提取线段几何特征,在线段检测、三维几何估计等任务中全面超越了传统非深度方法。
English: This paper introduces ScaleLSD, a scalable self-supervised learning model for line segment detection that outperforms traditional non-deep methods across various tasks, including 3D geometry estimation and line matching, using over 10 million unlabeled images.
Authors:Hongguang Zhu, Yunchao Wei, Mengyu Wang, Siyu Jiao, Yan Fang, Jiannan Huang, Yao Zhao
Abstract:
Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.
中文: SAGE通过语义增强擦除将单词概念消除转化为领域概念消除,并结合全局-局部协同保留机制,在不影响无关概念的前提下,显著提升了扩散模型的安全生成能力。
English: SAGE introduces semantic-augment erasing to transform word-level concept removal into domain-level erasure and employs a global-local retention mechanism, achieving superior safety in diffusion model generation without compromising unrelated concepts.
Authors:Yitong Zhang, Jia Li, Liyi Cai, Ge Li
Abstract:
Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.
中文摘要:提出的深度对齐视觉安全提示(DAVSP)通过可训练的视觉边界扩展和激活空间对齐训练,有效增强大视觉语言模型抵御恶意查询的能力,同时保持正常输入的实用性。
English Summary: The proposed Deep Aligned Visual Safety Prompt (DAVSP) enhances Large Vision-Language Models' resistance to malicious visual queries while maintaining performance on benign inputs through a trainable visual padding mechanism and activation-space alignment training.
Authors:Xuemei Cao, Hanlin Gu, Xin Yang, Bingjun Wei, Haoyang Liang, Xiangkun Wang, Tianrui Li
Abstract:
Continual Learning (CL) primarily aims to retain knowledge to prevent catastrophic forgetting and transfer knowledge to facilitate learning new tasks. Unlike traditional methods, we propose a novel perspective: CL not only needs to prevent forgetting, but also requires intentional forgetting.This arises from existing CL methods ignoring biases in real-world data, leading the model to learn spurious correlations that transfer and amplify across tasks. From feature extraction and prediction results, we find that data biases simultaneously reduce CL's ability to retain and transfer knowledge. To address this, we propose ErrorEraser, a universal plugin that removes erroneous memories caused by biases in CL, enhancing performance in both new and old tasks. ErrorEraser consists of two modules: Error Identification and Error Erasure. The former learns the probability density distribution of task data in the feature space without prior knowledge, enabling accurate identification of potentially biased samples. The latter ensures only erroneous knowledge is erased by shifting the decision space of representative outlier samples. Additionally, an incremental feature distribution learning strategy is designed to reduce the resource overhead during error identification in downstream tasks. Extensive experimental results show that ErrorEraser significantly mitigates the negative impact of data biases, achieving higher accuracy and lower forgetting rates across three types of CL methods. The code is available at https://github.com/diadai/ErrorEraser.
中文摘要:本文提出ErrorEraser这一持续学习通用插件,通过误差识别和擦除双模块解决数据偏差问题,无需先验知识即可有效消除错误记忆,显著提升新旧任务性能。
English Summary: This paper introduces ErrorEraser, a universal plugin for continual learning that addresses data bias issues by identifying and erasing erroneous memories through two specialized modules, significantly improving performance across tasks.
Authors:Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu
Abstract:
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.
中文: RePO通过回放策略利用离策略样本进行更高效的策略优化,在适度增加计算成本的同时,相比GRPO实现了显著的性能提升。
English: RePO introduces replay strategies to utilize off-policy samples for more efficient policy optimization in LLMs, achieving significant performance gains over GRPO with a moderate increase in computational cost.
Authors:Tong Wang, Guanzhou Chen, Xiaodong Zhang, Chenxi Liu, Jiaqi Wang, Xiaoliang Tan, Wenchao Guo, Qingyuan Yang, Kaiqi Zhang
Abstract:
Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30\% and 76.50\%, with only 50\% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51\%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.
中文摘要:本研究提出了一种多模态自监督学习框架,通过创新的掩码策略有效利用RGB、多光谱和DSM数据,在26个遥感任务中超越现有预训练方法,展现出卓越性能。
English Summary: This study introduces a multi-modal self-supervised learning framework that effectively utilizes RGB, multi-spectral, and DSM data through innovative masking strategies, achieving superior performance across 26 remote sensing tasks compared to existing methods.
Authors:Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella
Abstract:
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serve yields 2.3x faster inference than Llama2-7B and 3.0x faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.
中文摘要:DSLA-Serve通过双状态线性注意力机制解决传统线性注意力对近期标记的过度关注问题,并采用自适应蒸馏框架在保持任务性能的同时,实现比同类模型快2.3-3倍的推理速度。
English Summary: DSLA-Serve introduces dual-state linear attention to mitigate linear attention's recency bias and an adaptive distillation framework that accelerates inference by 2.3-3x over comparable models while maintaining task performance.
Authors:Mojtaba Nafez, Amirhossein Koochakian, Arad Maleki, Jafar Habibi, Mohammad Hossein Rohban
Abstract:
Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2\%$ in AD and $68.5\%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .
Chinese: 本研究提出了PatchGuard,一种利用伪异常和视觉Transformer框架的鲁棒异常检测与定位方法,显著提升了对抗攻击的防御能力,在对抗性和标准环境下均实现了显著性能提升。
English: This study introduces PatchGuard, a robust method for anomaly detection and localization that uses pseudo anomalies and a Vision Transformer framework to significantly enhance resistance against adversarial attacks, achieving notable performance improvements in both adversarial and standard settings.
Authors:Boyu Jiang, Liang Shi, Zhengzhi Lin, Loren Stowe, Feng Guo
Abstract:
The performance of perception systems in autonomous driving systems (ADS) is strongly influenced by object distance, scene dynamics, and environmental conditions such as weather. AI-based perception outputs are inherently stochastic, with variability driven by these external factors, while traditional evaluation metrics remain static and event-independent, failing to capture fluctuations in confidence over time. In this work, we introduce the Perception Characteristics Distance (PCD) -- a novel evaluation metric that quantifies the farthest distance at which an object can be reliably detected, incorporating uncertainty in model outputs. To support this, we present the SensorRainFall dataset, collected on the Virginia Smart Road using a sensor-equipped vehicle (cameras, radar, LiDAR) under controlled daylight-clear and daylight-rain scenarios, with precise ground-truth distances to the target objects. Statistical analysis reveals the presence of change points in the variance of detection confidence score with distance. By averaging the PCD values across a range of detection quality thresholds and probabilistic thresholds, we compute the mean PCD (mPCD), which captures the overall perception characteristics of a system with respect to detection distance. Applying state-of-the-art perception models shows that mPCD captures meaningful reliability differences under varying weather conditions -- differences that static metrics overlook. PCD provides a principled, distribution-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation. The SensorRainFall dataset is publicly available at https://www.kaggle.com/datasets/datadrivenwheels/sensorrainfall, and the evaluation code is open-sourced at https://github.com/datadrivenwheels/PCD_Python.
中文摘要:本文提出感知特征距离(PCD)这一新型评估指标,通过量化可靠检测距离并整合模型不确定性,结合SensorRainFall数据集验证了PCD能有效捕捉天气条件引起的可靠性差异,而传统静态指标无法实现这种动态评估。
English Summary: This paper introduces the Perception Characteristics Distance (PCD), a novel metric that quantifies reliable object detection range while incorporating model uncertainty, and presents the SensorRainFall dataset to demonstrate how PCD effectively captures weather-induced reliability variations that static metrics miss.
Authors:Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba
Abstract:
Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.
中文: FedRAG是一个框架,支持在集中式和联邦式架构下对检索增强生成系统进行微调,集成了先进方法与现代RAG工具,填补了现有工具的空白。
English: FedRAG is a framework that enables fine-tuning of retrieval-augmented generation systems across both centralized and federated architectures, integrating advanced methods and modern RAG tools to bridge existing gaps.
Authors:Emirhan Bilgiç, Neslihan Serap Åengör, Namık Berk Yalabık, Yavuz Selim İÅler, Aykut Görkem Gelen, Rahmi Elibol
Abstract:
This study examines the integration of Contrastive Predictive Coding (CPC) with Spiking Neural Networks (SNN). While CPC learns the predictive structure of data to generate meaningful representations, SNN mimics the computational processes of biological neural systems over time. In this study, the goal is to develop a predictive coding model with greater biological plausibility by processing inputs and outputs in a spike-based system. The proposed model was tested on the MNIST dataset and achieved a high classification rate in distinguishing positive sequential samples from non-sequential negative samples. The study demonstrates that CPC can be effectively combined with SNN, showing that an SNN trained for classification tasks can also function as an encoding mechanism. Project codes and detailed results can be accessed on our GitHub page: https://github.com/vnd-ogrenme/ongorusel-kodlama/tree/main/CPC_SNN
本研究成功将对比预测编码与脉冲神经网络相结合,开发出具有更高生物可信度的模型,在MNIST数据集上实现高分类精度的同时,证明了脉冲神经网络在编码和分类任务中的双重功能。
This study successfully combines Contrastive Predictive Coding with Spiking Neural Networks to create a biologically plausible model that achieves high classification accuracy on the MNIST dataset while demonstrating dual functionality in both encoding and classification tasks.
Authors:Yilin Zhuang, Karthik Duraisamy
Abstract:
Accurate probabilistic weather forecasting demands both high accuracy and efficient uncertainty quantification, challenges that overburden both ensemble numerical weather prediction (NWP) and recent machine-learning methods. We introduce LaDCast, the first global latent-diffusion framework for medium-range ensemble forecasting, which generates hourly ensemble forecasts entirely in a learned latent space. An autoencoder compresses high-dimensional ERA5 reanalysis fields into a compact representation, and a transformer-based diffusion model produces sequential latent updates with arbitrary hour initialization. The model incorporates Geometric Rotary Position Embedding (GeoRoPE) to account for the Earth's spherical geometry, a dual-stream attention mechanism for efficient conditioning, and sinusoidal temporal embeddings to capture seasonal patterns. LaDCast achieves deterministic and probabilistic skill close to that of the European Centre for Medium-Range Forecast IFS-ENS, without any explicit perturbations. Notably, LaDCast demonstrates superior performance in tracking rare extreme events such as cyclones, capturing their trajectories more accurately than established models. By operating in latent space, LaDCast reduces storage and compute by orders of magnitude, demonstrating a practical path toward forecasting at kilometer-scale resolution in real time. We open-source our code and models and provide the training and evaluation pipelines at: https://github.com/tonyzyl/ladcast.
中文摘要:LaDCast首次提出全球潜在扩散框架用于中期集合天气预报,在保持与主流模型相当准确度的同时大幅降低计算成本,并在追踪极端天气事件方面展现出卓越性能。
English Summary: LaDCast introduces the first global latent-diffusion framework for medium-range ensemble weather forecasting, achieving accuracy comparable to leading models while significantly reducing computational costs and demonstrating superior performance in tracking extreme weather events.
Authors:Haoyuan Cai, Zhenghao Peng, Bolei Zhou
Abstract:
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic the human intervention rule and adjusts intervention requests based on the alignment between agent and human actions. By assigning high Q-values when the agent deviates from the expert and decreasing these values as the agent becomes proficient, the proxy Q-function enables the agent to assess the real-time alignment with the expert and request assistance when needed. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Compared to the uncertainty-based baseline Thrifty-DAgger, our method achieves a 40% improvement in terms of human take-over cost and learning efficiency. Furthermore, AIM effectively identifies safety-critical states for expert assistance, thereby collecting higher-quality expert demonstrations and reducing overall expert data and environment interactions needed. Code and demo video are available at https://github.com/metadriverse/AIM.
中文: 提出的自适应干预机制(AIM)通过代理Q函数自适应地请求专家演示,显著减少了交互式模仿学习中的人工监督需求,与基线方法相比效率提升40%,并能更有效地识别安全关键状态。
English: The proposed Adaptive Intervention Mechanism (AIM) significantly reduces human supervision in Interactive Imitation Learning by using a proxy Q-function to adaptively request expert demonstrations, achieving 40% improvement in efficiency and better identification of safety-critical states compared to baseline methods.
Authors:Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana PerliÄ, Ekaterina Borisova, Markarit Vartampetian
Abstract:
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
中文: 本文提出LLM作为定性评判者的方法,通过生成自然语言生成系统中常见问题的结构化报告为开发者提供改进见解,并在多个数据集评估中验证了其识别具体问题和生成类人工报告的能力。
English: This paper introduces LLM-as-a-qualitative-judge, an approach that generates structured reports of common issues in natural language generation systems to provide developers with actionable insights, demonstrating its effectiveness through evaluations on multiple datasets.
Authors:MÃriam Barrabés, Daniel Mas Montserrat, Kapal Dev, Alexander G. Ioannidis
Abstract:
Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multi-sensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at https://github.com/AI-sandbox/FSL-Net.
Chinese: FSL-Net 是一种神经网络,能够无需重新训练即可快速准确地定位大型高维数据集中的特征偏移。
English: FSL-Net is a neural network that accurately and efficiently localizes feature shifts in large, high-dimensional datasets without requiring retraining for new data.
Authors:Chaoyang Zhou, Shunyu Liu, Zengmao Wang, Di Wang, Rong-Cheng Tu, Bo Du, Dacheng Tao
Abstract:
Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Besides, we show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results. Our code is provided in https://github.com/chaoyang101/ICRM.
中文: 本文提出轨迹内一致性正则化方法,利用生成概率在响应过程中建立奖励一致性,从而提升奖励模型在基准测试中的表现,并优化策略对齐和推理时验证效果。
English: This paper introduces intra-trajectory consistency regularization, which uses generation probabilities to align rewards across response processes, enhancing reward model performance on benchmarks and improving policy alignment and inference-time verification.
Authors:Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang
Abstract:
We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.
中文: FlagEvalMM是一个开源框架,通过将模型推理与评估解耦并采用加速工具,能高效评估多模态模型在各种视觉语言任务上的表现,为研究提供准确性能分析。
English: FlagEvalMM is an open-source framework that efficiently evaluates multimodal models across diverse vision-language tasks by decoupling inference from evaluation and utilizing acceleration tools for enhanced performance insights.
Authors:Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
Abstract:
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
中文: 现有的大规模视觉语言模型常因仅关注文本而忽略细粒度视觉信息,但提出的自回归语义视觉重建(ASVR)方法通过联合学习视觉与文本模态,显著提升了多模态理解能力,在多个基准测试中取得明显性能增益。
English: Current large vision-language models often neglect fine-grained visual details by focusing only on text, but the proposed Autoregressive Semantic Visual Reconstruction (ASVR) method enhances multimodal understanding by jointly learning visual and textual modalities, achieving significant performance improvements across benchmarks.
Authors:Haozhen Zhang, Tao Feng, Jiaxuan You
Abstract:
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
中文: 本文提出Router-R1强化学习框架,通过将多模型路由构建为序列决策过程,动态调用并整合不同大语言模型的优势,在多个基准测试中实现了性能与成本的最优平衡。
English: This paper introduces Router-R1, a reinforcement learning framework that enhances multi-LLM routing by dynamically selecting and aggregating models through sequential decision-making, achieving superior performance and cost efficiency across diverse benchmarks.
Authors:Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood
Abstract:
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab
Chinese: 本研究表明,在计算病理学中,经过预训练的多示例学习模型即使应用于不同器官也始终优于从头训练的模型,且全癌种预训练能以比基础模型更少的数据实现强大的泛化能力。
English: This study demonstrates that pretrained Multiple Instance Learning (MIL) models in computational pathology consistently outperform models trained from scratch, even across different organs, and that pancancer pretraining enables strong generalization with less data than foundation models.
Authors:Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin
Abstract:
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of **SWE-Flow** is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step *development schedule*. At each step, **SWE-Flow** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/SWE-Flow).
中文摘要:SWE-Flow是一种基于测试驱动开发的新型数据合成框架,通过单元测试自动生成可验证的开发步骤,创建的SWE-Flow-Eval基准显著提升了AI模型在代码生成任务中的表现。
English Summary: SWE-Flow is a novel TDD-based data synthesis framework that automatically generates verifiable development steps from unit tests, creating the SWE-Flow-Eval benchmark which significantly improves AI coding performance when used for fine-tuning.
Authors:Theo Zhang, Madurya Suresh, Anne S. Warlaumont, Kasia Hitczenko, Alejandrina Cristia, Margaret Cychosz
Abstract:
Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.
中文: 研究人员开发了一个名为SpeechMaturity的大型生态有效数据集,用于训练Transformer模型精确分类儿童发声,实现了与人类相当的分类准确率,并在不同环境中表现出强鲁棒性。
English: Researchers developed a large, ecologically-valid dataset called SpeechMaturity to train transformer models that accurately classify child vocalizations, achieving human-comparable accuracy and robustness across diverse settings.
Authors:Fabian Immel, Jan-Hendrik Pauls, Richard Fehler, Frank Bieder, Jonas Merkert, Christoph Stiller
Abstract:
Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet
中文: SDTagNet是一种在线高精地图构建方法,通过充分利用广泛可用的标准地图,结合折线数据和文本注释以丰富语义信息,并采用点级编码器和正交标识符统一整合所有地图元素,从而显著提升远距离检测精度。
English: SDTagNet is an online HD map construction method that enhances far-range detection accuracy by fully utilizing widely available SD maps, incorporating both polyline data and textual annotations for enriched semantic information, and integrating all map elements uniformly through a point-level encoder and orthogonal identifiers.
Authors:Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, Jing Qin, Liansheng Wang
Abstract:
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.
中文: ALTA方法通过适配预训练的视觉模型,以极低的计算成本实现了医学视觉与语言的高效对齐,在检索和零样本分类任务中表现卓越。
English: The proposed ALTA method efficiently aligns medical vision and language by adapting a pretrained vision model from masked record modeling, achieving superior performance in retrieval and zero-shot classification with minimal computational resources.
Authors:Victoria Hankemeier, Malte Schilling
Abstract:
Developments in Deep Learning have significantly improved time series forecasting by enabling more accurate modeling of complex temporal dependencies inherent in sequential data. The effectiveness of such models is often demonstrated on limited sets of specific real-world data. Although this allows for comparative analysis, it still does not demonstrate how specific data characteristics align with the architectural strengths of individual models. Our research aims at uncovering clear connections between time series characteristics and particular models. We introduce a novel dataset generated using Gaussian Processes, specifically designed to display distinct, known characteristics for targeted evaluations of model adaptability to them. Furthermore, we present TimeFlex, a new model that incorporates a modular architecture tailored to handle diverse temporal dynamics, including trends and periodic patterns. This model is compared to current state-of-the-art models, offering a deeper understanding of how models perform under varied time series conditions.
Chinese: 深度学习进展通过建模复杂依赖提升了时间序列预测能力,但现有评估未能将数据特性与模型优势关联,为此引入高斯过程生成的数据集和TimeFlex模型,以针对性评估模型适应性。
English: Deep learning advancements have enhanced time series forecasting by modeling complex dependencies, yet current evaluations fail to link data traits with model strengths, prompting the introduction of a Gaussian Processes dataset and TimeFlex model for targeted adaptability assessments.
Authors:Hongjie Zhu, Xiwei Liu, Rundong Xue, Zeyu Zhang, Yong Xu, Daji Ergu, Ying Cai, Yang Zhao
Abstract:
In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and emerging as a highly promising research direction in medical image analysis. Inspired by the ability of Vision Foundation Models (e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised SAM-2), a novel approach that leverages SAM-2's robust feature extraction capabilities to uncover latent knowledge in unlabeled medical images, thus effectively enhancing feature support for fully supervised medical image segmentation. Specifically, building upon the single-stream "weak-to-strong" consistency regularization framework, this paper introduces a Discriminative Feature Enhancement (DFE) mechanism to further explore the feature discrepancies introduced by various data augmentation strategies across multiple views. By leveraging feature similarity and dissimilarity across multi-scale augmentation techniques, the method reconstructs and models the features, thereby effectively optimizing the salient regions. Furthermore, a prompt generator is developed that integrates Physical Constraints with a Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data, fulfilling SAM-2's requirement for additional prompts. Extensive experiments demonstrate the superiority of the proposed method for semi-supervised medical image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably, SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous state-of-the-art method by +3.65 Dice. Code will be available at https://github.com/AIGeeksGroup/SSS.
中文: 提出的SSS方法利用SAM-2的特征提取能力,通过引入判别性特征增强机制优化半监督医学图像分割,在BHSD数据集上以53.15的Dice分数实现了最先进的性能。
English: The proposed SSS method leverages SAM-2's feature extraction capabilities and introduces a Discriminative Feature Enhancement mechanism to optimize semi-supervised medical image segmentation, achieving state-of-the-art performance with a 53.15 Dice score on the BHSD dataset.
Authors:Hang Ye, Gaoxiang Duan, Haoran Zeng, Yangxin Zhu, Lingxue Meng, Xiaoying Zheng, Yongxin Zhu
Abstract:
Multivariate long-term and efficient time series forecasting is a key requirement for a variety of practical applications, and there are complex interleaving time dynamics in time series data that require decomposition modeling. Traditional time series decomposition methods are single and rely on fixed rules, which are insufficient for mining the potential information of the series and adapting to the dynamic characteristics of complex series. On the other hand, the Transformer-based models for time series forecasting struggle to effectively model long sequences and intricate dynamic relationships due to their high computational complexity. To overcome these limitations, we introduce KARMA, with an Adaptive Time Channel Decomposition module (ATCD) to dynamically extract trend and seasonal components. It further integrates a Hybrid Frequency-Time Decomposition module (HFTD) to further decompose Series into frequency-domain and time-domain. These components are coupled with multi-scale Mamba-based KarmaBlock to efficiently process global and local information in a coordinated manner. Experiments on eight real-world datasets from diverse domains well demonstrated that KARMA significantly outperforms mainstream baseline methods in both predictive accuracy and computational efficiency. Code and full results are available at this repository: https://github.com/yedadasd/KARMA
中文摘要:KARMA通过自适应时间通道分解和混合频时分解模块动态提取时序成分,结合多尺度Mamba模块协同处理信息,在多个真实数据集上显著超越了主流基线方法的预测精度与计算效率。
English Summary: KARMA introduces adaptive decomposition modules and Mamba-based blocks to dynamically extract time series components, significantly outperforming existing methods in both accuracy and efficiency across multiple real-world datasets.
Authors:Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su
Abstract:
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM`s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model`s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model`s parametric knowledge, which undermines the model`s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG
中文: 本文提出FaithfulRAG创新框架,通过显式建模参数化知识与检索上下文间的差异,在生成响应前进行自我推理以整合冲突事实,从而解决检索增强大语言模型中的知识冲突问题。
English: This paper introduces FaithfulRAG, a novel framework that resolves knowledge conflicts in retrieval-augmented LLMs by explicitly modeling discrepancies between parametric knowledge and retrieved context, allowing self-reasoning to integrate conflicting facts before response generation.
Authors:Andrew Shin
Abstract:
While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.
Chinese: 该研究通过强化学习和内存优化技术,在单个游戏显卡上成功训练出性能优异的15亿参数数学推理模型,打破了高性能AI研究必须依赖庞大计算资源的传统模式。
English: This research demonstrates that a single gaming GPU can train a competitive 1.5B-parameter mathematical reasoning model using reinforcement learning and memory optimization, challenging the need for massive computational infrastructure.
Authors:Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
Abstract:
We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam
Chinese: 本文提出了一种基于注意力的两阶段框架,首先识别任务相关的图像区域,然后使用二进制注意力掩码聚焦分析这些区域,显著提高了对虚假关联和分布外背景的鲁棒性。
English: This paper presents a two-stage attention-based framework that first identifies task-relevant image regions and then uses binary attention masks to focus analysis on these areas, significantly enhancing robustness against spurious correlations and out-of-distribution backgrounds.
Authors:Jiajun Li, Yue Ma, Xinyu Zhang, Qingyan Wei, Songhua Liu, Linfeng Zhang
Abstract:
Recent studies on Visual Autoregressive (VAR) models have highlighted that high-frequency components, or later steps, in the generation process contribute disproportionately to inference latency. However, the underlying computational redundancy involved in these steps has yet to be thoroughly investigated. In this paper, we conduct an in-depth analysis of the VAR inference process and identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy. To address step redundancy, we propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency. For unconditional branch redundancy, we observe that the information gap between the conditional and unconditional branches is minimal. Leveraging this insight, we introduce unconditional branch replacement, a technique that bypasses the unconditional branch to reduce computational cost. Notably, we observe that the effectiveness of acceleration strategies varies significantly across different samples. Motivated by this, we propose SkipVAR, a sample-adaptive framework that leverages frequency information to dynamically select the most suitable acceleration strategy for each instance. To evaluate the role of high-frequency information, we introduce high-variation benchmark datasets that test model sensitivity to fine details. Extensive experiments show SkipVAR achieves over 0.88 average SSIM with up to 1.81x overall acceleration and 2.62x speedup on the GenEval benchmark, maintaining model quality. These results confirm the effectiveness of frequency-aware, training-free adaptive acceleration for scalable autoregressive image generation. Our code is available at https://github.com/fakerone-li/SkipVAR and has been publicly released.
中文: 针对视觉自回归模型中的步骤冗余和无条件分支冗余问题,研究者提出了SkipVAR框架,通过动态选择加速策略,在保持图像质量的同时实现了显著的速度提升。
English: Recent research on Visual Autoregressive models identifies step and unconditional branch redundancies as key inefficiencies, leading to the development of SkipVAR, a sample-adaptive framework that dynamically applies acceleration strategies to achieve significant speed improvements while preserving image quality.
Authors:Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach
Abstract:
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom
大规模预训练通过基于意图的流匹配模型预测强化学习中具有长期依赖性的未来状态占用,相比现有方法显著提升了任务成功率与回报中位数。
Large-scale pre-training in reinforcement learning addresses long-term action dependencies by modeling future state occupancy with intention-aware flow matching, achieving significant performance gains over existing methods.
Authors:Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach
Abstract:
Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom
大规模预训练通过基于意图的流匹配模型预测强化学习中具有长期依赖性的未来状态占用,相比现有方法显著提升了任务成功率与回报中位数。
Large-scale pre-training in reinforcement learning addresses long-term action dependencies by modeling future state occupancy with intention-aware flow matching, achieving significant performance gains over existing methods.
Authors:José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre, Markus Gumpinger, Georg Faustmann, Marzieh Oghbaie, Ursula Schmidt-Erfurth, Hrvoje BogunoviÄ
Abstract:
Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.
Chinese: MIRAGE是一种新型多模态基础模型,用于分析OCT和SLO图像,在分类和分割任务中相比现有方法展现出卓越性能。
English: MIRAGE is a novel multimodal foundation model designed for analyzing OCT and SLO images, demonstrating superior performance in classification and segmentation tasks compared to existing methods.
Authors:José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre, Markus Gumpinger, Georg Faustmann, Marzieh Oghbaie, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Abstract:
Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.
Chinese: MIRAGE是一种新型多模态基础模型,用于分析OCT和SLO图像,在分类和分割任务中相比现有方法展现出卓越性能。
English: MIRAGE is a novel multimodal foundation model designed for analyzing OCT and SLO images, demonstrating superior performance in classification and segmentation tasks compared to existing methods.
Authors:Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang
Abstract:
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.
Chinese: SeerAttention-R 是一种专为推理模型长解码设计的稀疏注意力框架,在保持接近无损精度的同时,通过优化内核在 H100 GPU 上实现了高达 9 倍的加速。
English: SeerAttention-R is a sparse attention framework designed for efficient long decoding in reasoning models, maintaining near-lossless accuracy with high sparsity and achieving up to 9x speedup on H100 GPUs through optimized kernels.
Authors:Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Jungong Han, Guiguang Ding
Abstract:
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.
Chinese: 本文提出DiscoVLA方法,通过融合图像-视频特征和利用图像级对齐知识,同时解决CLIP模型在视频-文本检索中的视觉、语言和对齐三大差异,在MSRVTT数据集上以50.5%的R@1分数实现了最优性能。
English: This paper introduces DiscoVLA, a method that simultaneously addresses vision, language, and alignment discrepancies in adapting CLIP for video-text retrieval by integrating image-video features and leveraging image-level alignment knowledge, achieving state-of-the-art performance with a 50.5% R@1 score on MSRVTT.
Authors:Shiqin Tang, Shujian Yu
Abstract:
Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at https://github.com/marcusstang/InfoDPCCA.
中文: InfoDPCCA是一种动态概率典型相关分析框架,通过信息论目标提取共享和序列特定的潜在表示,在建模相互依赖的序列数据时提升了可解释性和鲁棒性。
English: InfoDPCCA is a dynamic probabilistic CCA framework that uses an information-theoretic objective to extract shared and sequence-specific latent representations, improving interpretability and robustness in modeling interdependent sequential data.
Authors:Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao
Abstract:
Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.
中文: StreamSplat是一种开创性的前馈框架,能将未校准视频流在线转换为动态3D高斯溅射表示,在实时重建、动态建模和计算效率方面表现卓越。
English: StreamSplat is a pioneering feed-forward framework that converts uncalibrated video streams into dynamic 3D Gaussian Splatting representations online, excelling in real-time reconstruction, dynamic modeling, and computational efficiency.
Authors:Jingguo Qu, Xinyang Han, Tonghuan Xiao, Jia Ai, Juan Wu, Tong Zhao, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying
Abstract:
Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at https://github.com/jinggqu/NextGen-UIA.
中文: 本研究开发了视觉语言基础模型的领域自适应方法,通过使用大语言模型作为文本优化器和任务驱动头来提升超声图像分析性能,在多个数据集的分割和分类任务中均超越了现有最优模型。
English: This study develops domain adaptation methods for vision-language foundation models, utilizing large language models as text refiners and task-driven heads to enhance ultrasound image analysis, achieving superior performance in segmentation and classification tasks across multiple datasets.
Authors:Zhiyuan Ma, Ruixun Liu, Sixian Liu, Jianjun Li, Bowen Zhou
Abstract:
Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field $\bm v$ between the two distributions $\bmÏ_0$ and $\bmÏ_1$. In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields'') to expand the search space, especially when close to the distribution $p_\text{noise}$. Different from the previous case where noise is directly superimposed on $\bm x$, we introduce noise on the velocity $\bm v$ of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at https://github.com/liuruixun/momentum-fm.
中文:Discretized-RF通过将直线采样路径分解为可变速度子路径并在速度分量上引入噪声,提升了整流流模型的多样性和多尺度噪声建模能力,同时保持了高效性。
English: Discretized-RF enhances rectified flow models by breaking the straight sampling path into variable-velocity sub-paths with noise introduced on velocity components, improving both diversity and multi-scale noise modeling while maintaining efficiency.
Authors:Yuni Susanti, Michael Färber
Abstract:
Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery -- which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) -- offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality
中文摘要:本研究提出了一种将知识图谱与大型语言模型相结合的新方法,通过元路径子图排序和零样本提示增强基于知识的因果发现,在多个数据集上相比基线方法F1分数最高提升44.4分。
English Summary: This study introduces a novel approach that integrates Knowledge Graphs with Large Language Models to enhance knowledge-based causal discovery, achieving up to 44.4-point F1 score improvements over baselines through metapath subgraph ranking and zero-shot prompting.
Authors:Boyang Sun, Yu Yao, Xinshuai Dong, Zongfang Liu, Tongliang Liu, Yumou Qiu, Kun Zhang
Abstract:
In many real-world scenarios, interested variables are often represented as discretized values due to measurement limitations. Applying Conditional Independence (CI) tests directly to such discretized data, however, can lead to incorrect conclusions. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data. However, this process inevitably results in a loss of information, which degrades the test's performance. Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process. We find that the independence relationships of latent continuous variables can be established by addressing an over-identifying restriction problem with Generalized Method of Moments (GMM). Based on this insight, we derive an appropriate test statistic and establish its asymptotic distribution correctly reflecting CI by leveraging nodewise regression. Theoretical findings and Empirical results across various datasets demonstrate that the superiority and effectiveness of our proposed test. Our code implementation is provided in https://github.com/boyangaaaaa/DCT
中文摘要:本文提出了一种样本高效的独立性检验方法,通过广义矩估计解决过度识别问题,利用节点回归建立潜变量关系,避免了数据二值化造成的信息损失。
English Summary: This paper introduces a sample-efficient conditional independence test that avoids information loss from data binarization by using Generalized Method of Moments to establish latent variable relationships through nodewise regression.
Authors:Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, Dacheng Tao
Abstract:
Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.
中文摘要:提出的CoVo框架通过利用推理轨迹的一致性,使大型语言模型能够进行自我奖励的强化学习,无需外部监督即可达到与监督方法相当的性能。
English Summary: The proposed CoVo framework enables self-rewarding reinforcement learning for large language models by leveraging consistency in reasoning trajectories, achieving performance comparable to supervised methods without external supervision.
Authors:Michael Färber, David Lamprecht, Yuni Susanti
Abstract:
Graph Neural Networks (GNNs) have substantially advanced the field of recommender systems. However, despite the creation of more than a thousand knowledge graphs (KGs) under the W3C standard RDF, their rich semantic information has not yet been fully leveraged in GNN-based recommender systems. To address this gap, we propose a comprehensive integration of RDF KGs with GNNs that utilizes both the topological information from RDF object properties and the content information from RDF datatype properties. Our main focus is an in-depth evaluation of various GNNs, analyzing how different semantic feature initializations and types of graph structure heterogeneity influence their performance in recommendation tasks. Through experiments across multiple recommendation scenarios involving multi-million-node RDF graphs, we demonstrate that harnessing the semantic richness of RDF KGs significantly improves recommender systems and lays the groundwork for GNN-based recommender systems for the Linked Open Data cloud. The code and data are available on our GitHub repository: https://github.com/davidlamprecht/rdf-gnn-recommendation
中文: 本研究提出将RDF知识图谱与图神经网络相结合,充分利用其拓扑结构和内容信息,并通过大规模实验证明,利用RDF语义丰富性能显著提升推荐系统的性能。
English: This study proposes integrating RDF knowledge graphs with Graph Neural Networks to leverage both topological and content information, demonstrating through large-scale experiments that utilizing RDF semantic richness significantly enhances recommendation performance.
Authors:Yuhang Wang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu, Wankou Yang
Abstract:
Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.
中文: InceptionMamba架构采用正交带状卷积替代传统一维条带卷积,并结合瓶颈Mamba模块强化空间与全局上下文建模,在图像分类及下游任务中以卓越的参数量与计算效率实现最优性能。
English: The proposed InceptionMamba architecture replaces traditional one-dimensional strip convolutions with orthogonal band convolutions and integrates a bottleneck Mamba module to enhance spatial and global context modeling, achieving state-of-the-art performance in image classification and downstream tasks with superior efficiency.
Authors:Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees G. M. Snoek, Yuki M Asano
Abstract:
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main
中文: 本文提出了一种运动引导的自监督学习框架,通过聚类密集点轨迹来学习时空一致的视频表征,在动态场景中增强了鲁棒性,并在多个基准测试中以1%至6%的优势超越了现有最优方法。
English: This paper introduces a motion-guided self-supervised learning framework that clusters dense point tracks to achieve spatiotemporally consistent video representations, improving robustness in dynamic scenes and outperforming state-of-the-art methods by 1% to 6% across multiple benchmarks.
Authors:Milica Å kipina, Nikola JoviÅ¡iÄ, Nicola Dall'Asen, Vanja Å venda, Anil Osman Tur, Slobodan IliÄ, Elisa Ricci, Dubravko Äulibrk
Abstract:
Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final model, significantly aiding the noise removal process. This design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly segmentation. Experiments, both numerical and radiologist validation, assess MAMBO's capabilities in image generation, super-resolution, and anomaly segmentation, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection. The source code used in this study is publicly available at: https://github.com/iai-rs/mambo.
中文: 本文提出MAMBO模型,通过基于分块的扩散方法生成高分辨率乳腺X光图像,以解决人工智能训练数据短缺问题,实验验证其在提升诊断精度和早期病灶检测方面的潜力。
English: The paper introduces MAMBO, a patch-based diffusion model that generates high-resolution mammograms to overcome data limitations in AI training, with experiments demonstrating its effectiveness in enhancing diagnostic accuracy and early lesion detection.
Authors:Milica Škipina, Nikola Jovišić, Nicola Dall'Asen, Vanja Švenda, Anil Osman Tur, Slobodan Ilić, Elisa Ricci, Dubravko Ćulibrk
Abstract:
Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final model, significantly aiding the noise removal process. This design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly segmentation. Experiments, both numerical and radiologist validation, assess MAMBO's capabilities in image generation, super-resolution, and anomaly segmentation, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection. The source code used in this study is publicly available at: https://github.com/iai-rs/mambo.
中文: 本文提出MAMBO模型,通过基于分块的扩散方法生成高分辨率乳腺X光图像,以解决人工智能训练数据短缺问题,实验验证其在提升诊断精度和早期病灶检测方面的潜力。
English: The paper introduces MAMBO, a patch-based diffusion model that generates high-resolution mammograms to overcome data limitations in AI training, with experiments demonstrating its effectiveness in enhancing diagnostic accuracy and early lesion detection.
Authors:Mahesh Godavarti
Abstract:
Transformers have demonstrated remarkable success in sequence modeling, yet effectively incorporating positional information remains a challenging and active area of research. In this paper, we introduce JoFormer, a journey-based Transformer architecture grounded in a recently proposed non-commutative algebra for composing transformations across positions. JoFormer represents relative positions through learnable directional transforms that are sequentially composed along the input, thereby extending and generalizing existing approaches based on relative position representations. We derive the JoFormer attention mechanism from first principles and show that it subsumes standard methods such as rotary transformations as special cases. To evaluate its effectiveness, we compare JoFormer to the RoFormer baseline on the Tiny Shakespeare character-level language modeling task. Our results demonstrate that
JoFormer consistently achieves lower perplexity and faster convergence, highlighting the advantages of its more expressive, journey-based treatment of position. Notably, the per-token JoFormer is still a primitive, conceptual variant with layer-independent angles, yet it already demonstrates strong performance-underscoring its promise as a proof of concept for more expressive architectures. We conclude by discussing how JoFormer offers a principled approach to integrating positional structure into Transformer architectures. The code used in this work is available at https://github.com/mahesh-godavarti/joformer.
中文: JoFormer是一种基于可学习方向变换的新型Transformer架构,通过相对位置建模在语言建模任务中实现了比RoFormer更低的困惑度和更快的收敛速度。
English: JoFormer is a novel Transformer architecture that uses learnable directional transforms to model relative positions, achieving lower perplexity and faster convergence than RoFormer in language modeling tasks.
Authors:Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang
Abstract:
Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer
中文: TableDreamer提出了一种渐进式、弱点引导的数据合成框架,通过生成多样化数据并迭代修正大语言模型在表格理解中的不足,以少量合成数据显著提升了模型性能。
English: TableDreamer introduces a progressive, weakness-guided framework to enhance table instruction tuning by synthesizing diverse data and iteratively addressing LLM weaknesses, significantly improving accuracy with minimal synthetic data.
Authors:Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata
Abstract:
Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.
中文: TiViT框架将时间序列转换为图像以利用预训练的视觉变换器,实现了最先进的分类性能,并揭示了在非视觉领域重用视觉表示的新方向。
English: The TiViT framework transforms time series into images to utilize pretrained Vision Transformers, achieving state-of-the-art classification performance and revealing a novel approach for reusing visual representations in non-visual domains.
Authors:Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee
Abstract:
Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal's energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.
中文摘要:本文提出了HSG-12M,这是首个大规模空间多重图数据集,通过保留节点间多条几何路径并基于光谱势数据构建,为几何感知的图学习及科学发现奠定了基础。
English Summary: This paper introduces HSG-12M, the first large-scale spatial multigraph dataset that preserves multiple geometrically distinct paths between nodes, derived from spectral potential data to enable geometry-aware graph learning and scientific discovery.
Authors:Kiran Purohit, V Venktesh, Sourangshu Bhattacharya, Avishek Anand
Abstract:
The in-context learning paradigm with LLMs has been instrumental in advancing a wide range of natural language processing tasks. The selection of few-shot examples (exemplars / demonstration samples) is essential for constructing effective prompts under context-length budget constraints. In this paper, we formulate the exemplar selection task as a top-m best arms identification problem. A key challenge in this setup is the exponentially large number of arms that need to be evaluated to identify the m-best arms. We propose CASE (Challenger Arm Sampling for Exemplar selection), a novel sample-efficient selective exploration strategy that maintains a shortlist of "challenger" arms, which are current candidates for the top-m arms. In each iteration, only one of the arms from this shortlist or the current topm set is pulled, thereby reducing sample complexity and, consequently, the number of LLM evaluations. Furthermore, we model the scores of exemplar subsets (arms) using a parameterized linear scoring function, leading to stochastic linear bandits setting. CASE achieves remarkable efficiency gains of up to 7x speedup in runtime while requiring 7x fewer LLM calls (87% reduction) without sacrificing performance compared to state-of-the-art exemplar selection methods. We release our code and data at https://github.com/kiranpurohit/CASE
中文: 本文提出CASE方法,将示例选择建模为最优臂识别问题,通过高效采样策略在保持性能的同时,将运行速度提升7倍并减少87%的大语言模型调用次数。
English: This paper introduces CASE, a sample-efficient strategy that formulates exemplar selection as a top-m best arms identification problem, achieving up to 7x faster runtime and 87% fewer LLM calls while maintaining performance compared to existing methods.
Authors:Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou
Abstract:
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
中文: 本研究针对文本编码器在识别细粒度实体和事件方面的局限性,通过引入CapRetrieval数据集并提出数据生成策略,使小型编码器性能超越大型模型,同时揭示了嵌入语义中的粒度困境。
English: This study addresses the limitation of text encoders in recognizing fine-grained entities and events by introducing the CapRetrieval dataset and proposing data generation strategies that enable a small encoder to outperform larger models, while also identifying the granularity dilemma in embedding semantics.
Authors:Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang
Abstract:
Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.
中文: 提出的多样性引导多层感知机缩减方法通过剪枝冗余神经元并保持权重多样性,以近乎无损的方式将大型视觉变换器的参数和计算量减少超过57%。
English: The proposed Diversity-Guided MLP Reduction (DGMR) method significantly compresses large vision transformers by pruning redundant MLP neurons while preserving weight diversity, achieving over 57% parameter and FLOPs reduction with negligible performance loss.
Authors:Sunny Gupta, Nikita Jangid, Shounak Das, Amit Sethi
Abstract:
Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: https://github.com/sunnyinAI/FedTail
中文摘要:FedTAIL作为一种联邦领域泛化框架,通过锐度感知优化解决了类别不平衡与目标冲突问题,在领域偏移和标签不平衡场景下实现了最先进的性能表现。
English Summary: FedTAIL is a federated domain generalization framework that tackles class imbalance and conflicting objectives through sharpness-aware optimization, achieving state-of-the-art performance under domain shifts and label imbalance.
Authors:Jiale Dong, Hao Wu, Zihao Wang, Wenqi Lou, Zhendong Zheng, Lei Gong, Chao Wang, Xuehai Zhou
Abstract:
Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 $\%$ accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators, effectively balancing performance and resource consumption. Experimental results demonstrate that our accelerator achieves nearly 155 frames per second, a 5.35$\times$ improvement in throughput, and over $80\%$ energy reduction compared to state-of-the-art (SOTA) FPGA MoE accelerators, while maintaining $<1\%$ accuracy loss across vision benchmarks. Our implementation is available at https://github.com/DJ000011/CoQMoE.
中文: 本文提出了一种新型FPGA加速器,通过双阶段量化方案和资源感知架构,在保持精度的同时显著提升了量化专家混合视觉变换器的吞吐量和能效。
English: This paper introduces a novel FPGA accelerator for quantized Mixture-of-Experts Vision Transformers that achieves high throughput and energy efficiency through dual-stage quantization and resource-aware architecture, with minimal accuracy loss.
Authors:Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu dang
Abstract:
Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.
Chinese: 针对广义意图发现提出的基于一致性的原型提示框架,通过原型提示传递外部知识和分层一致性约束学习目标领域知识,实现了无需额外标注的新意图发现,并取得了最先进的性能表现。
English: The proposed consistency-driven prototype-prompting framework for Generalized Intent Discovery effectively integrates old and new knowledge through prototype prompting and hierarchical consistency constraints, achieving state-of-the-art performance in discovering new intents without additional annotation.
Authors:Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
Abstract:
Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT
Chinese: 本研究提出AsFT安全微调方法,以对齐方向为锚点将更新约束在狭窄安全区域内,实验表明该方法能有效减少7.60%有害行为并提升3.44%模型性能。
English: The study introduces AsFT, a safety fine-tuning method that uses the alignment direction as an anchor to restrict updates within a narrow safety basin, effectively reducing harmful behavior by 7.60% and improving model performance by 3.44% in experiments.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage. Code is available at https://github.com/hseung88/mac.
中文: 提出的MAC方法高效近似KFAC中Fisher信息矩阵的Kronecker因子,在多种神经网络上实现了更高的精度、更快的训练速度和更低的内存消耗,并且是首个将Kronecker分解应用于Transformer注意力层的算法。
English: The proposed MAC method efficiently approximates the Kronecker factors in KFAC's Fisher information matrix, achieving superior accuracy, faster training, and lower memory usage across various neural networks while being the first to apply Kronecker factorization to transformer attention layers.
Authors:Ge Zhu, Yutong Wen, Zhiyao Duan
Abstract:
Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. These models have many different design choices suitable for different applications, however, existing reviews lack in-depth discussions of these design choices. The audio diffusion model literature also lacks principled guidance for the implementation of these design choices and their comparisons for different applications. This survey provides a comprehensive review of diffusion model design with an emphasis on design principles for quality improvement and conditioning for audio applications. We adopt the score modeling perspective as a unifying framework that accommodates various interpretations, including recent approaches like flow matching. We systematically examine the training and sampling procedures of diffusion models, and audio applications through different conditioning mechanisms. To address the lack of audio diffusion model codebases and to promote reproducible research and rapid prototyping, we introduce an open-source codebase at https://github.com/gzhu06/AudioDiffuser that implements our reviewed framework for various audio applications. We demonstrate its capabilities through three case studies: audio generation, speech enhancement, and text-to-speech synthesis, with benchmark evaluations on standard datasets.
中文摘要:本综述全面探讨了扩散模型在音频应用中的设计原则与条件机制,并推出了开源代码库以促进可复现研究,通过案例研究展示了其在音频生成、语音增强和文本转语音等任务中的性能。
English Summary: This survey comprehensively reviews diffusion models with a focus on audio applications, analyzing design choices for quality improvement and conditioning while introducing an open-source codebase to support reproducible research.
Authors:Zengjue Chen, Runliang Niu, He Kong, Qi Wang
Abstract:
Recent advances in Vision-Language-Action (VLA) model have demonstrated strong generalization capabilities across diverse scenes, tasks, and robotic platforms when pretrained at large-scale datasets. However, these models still require task-specific fine-tuning in novel environments, a process that relies almost exclusively on supervised fine-tuning (SFT) using static trajectory datasets. Such approaches neither allow robot to interact with environment nor do they leverage feedback from live execution. Also, their success is critically dependent on the size and quality of the collected trajectories. Reinforcement learning (RL) offers a promising alternative by enabling closed-loop interaction and aligning learned policies directly with task objectives. In this work, we draw inspiration from the ideas of GRPO and propose the Trajectory-wise Group Relative Policy Optimization (TGRPO) method. By fusing step-level and trajectory-level advantage signals, this method improves GRPO's group-level advantage estimation, thereby making the algorithm more suitable for online reinforcement learning training of VLA. Experimental results on ten manipulation tasks from the libero-object benchmark demonstrate that TGRPO consistently outperforms various baseline methods, capable of generating more robust and efficient policies across multiple tested scenarios. Our source codes are available at: https://github.com/hahans/TGRPO
中文: TGRPO方法通过融合步级和轨迹级优势信号,改进了视觉-语言-动作模型的在线强化学习,在多项操作任务中展现出超越基准方法的更强鲁棒性和效率。
English: The TGRPO method enhances Vision-Language-Action models by integrating step-level and trajectory-level advantages, outperforming baselines in robotic manipulation tasks through improved online reinforcement learning.
Authors:Jiaxiang Liu, Boxuan Xing, Chenhao Yuan, Chenxiang Zhang, Di Wu, Xiusheng Huang, Haida Yu, Chuhan Lang, Pengfei Cao, Jun Zhao, Kang Liu
Abstract:
As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model's internal knowledge mechanisms from multiple perspectives. Our code is available at https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on https://youtu.be/NVWZABJ43Bs.
Chinese: 针对当前大语言模型解释方法存在的局限性,我们开发了开源工具Know-MRI,其可扩展核心能自动匹配输入与解释方法并整合输出结果,从而支持从多角度全面诊断模型的内部知识机制。
English: To address the limitations of current interpretation methods for large language models, we introduce Know-MRI, an open-source tool with an extensible core that automatically matches inputs with interpretation methods and integrates outputs for comprehensive analysis of internal knowledge mechanisms.
Authors:Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, Desheng Ma, Samuel S. Welborn, Mikolaj Jakowski, Shawn-Patrick Barhorst, Alexander J. Pattison, Panayotis Manganaris, Sita Sirisha Madugula, Sai Venkata Gayathri Ayyagari, Vishal Kennedy, Ralph Bulanadi, Michelle Wang, Kieran J. Pang, Ian Addison-Smith, Willy Menacho, Horacio V. Guzman, Alexander Kiefer, Nicholas Furth, Nikola L. Kolev, Mikhail Petrov, Viktoriia Liu, Sergey Ilyev, Srikar Rairao, Tommaso Rodani, Ivan Pinto-Huguet, Xuli Chen, Josep Cruañes, Marta Torrens, Jovan Pomar, Fanzhi Su, Pawan Vedanti, Zhiheng Lyu, Xingzhi Wang, Lehan Yao, Amir Taqieddin, Forrest Laskowski, Xiangyu Yin, Yu-Tsun Shao, Benjamin Fein-Ashley, Yi Jiang, Vineet Kumar, Himanshu Mishra, Yogesh Paul, Adib Bazgir, Rama chandra Praneeth Madugula, Yuwen Zhang, Pravan Omprakash, Jian Huang, Eric Montufar-Morales, Vivek Chawla, Harshit Sethi, Jie Huang, Lauri Kurki, Grace Guinan, Addison Salvador, Arman Ter-Petrosyan, Madeline Van Winkle, Steven R. Spurgeon, Ganesh Narasimha, Zijie Wu, Richard Liu, Yongtao Liu, Boris Slautin, Andrew R Lupini, Rama Vasudevan, Gerd Duscher, Sergei V. Kalinin
Abstract:
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales. The data generated is often well-structured, enriched with metadata and sample histories, though not always consistent in detail or format. The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access. However, deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies. As a result, data usage is inefficient and analysis time is extensive. In addition to post-acquisition analysis, new APIs from major microscope manufacturers enable real-time, ML-based analytics for automated decision-making and ML-agent-controlled microscope operation. Yet, a gap remains between the ML and microscopy communities, limiting the impact of these methods on physics, materials discovery, and optimization. Hackathons help bridge this divide by fostering collaboration between ML researchers and microscopy experts. They encourage the development of novel solutions that apply ML to microscopy, while preparing a future workforce for instrumentation, materials science, and applied ML. This hackathon produced benchmark datasets and digital twins of microscopes to support community growth and standardized workflows. All related code is available at GitHub: https://github.com/KalininGroup/Mic-hackathon-2024-codes-publication/tree/1.0.0.1
中文: 显微镜技术产生丰富但格式不一的数据,缺乏标准化工具和集成策略阻碍了高效分析,而黑客松活动通过连接机器学习与显微学领域,推动了基准数据集和数字孪生等解决方案的开发。
English: Microscopy generates rich but inconsistently formatted data, where the lack of standardized tools and integration hinders efficient analysis, though hackathons bridge ML and microscopy communities to develop solutions like benchmarks and digital twins.
Authors:Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Hu Song, Linfeng Zhang
Abstract:
Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.
中文: TACTIC框架提出了一种基于认知理论的多智能体翻译系统,通过模拟人类译者的策略分工协作,在多项语言基准测试中实现了最优性能表现。
English: The TACTIC framework introduces a cognitively inspired multi-agent system that simulates human translation strategies, achieving state-of-the-art performance by integrating specialized agents for drafting, refinement, and evaluation across diverse language benchmarks.
Authors:Leheng Sheng, An Zhang, Zijian Wu, Weixiang Zhao, Changshuo Shen, Yi Zhang, Xiang Wang, Tat-Seng Chua
Abstract:
Recent studies empirically reveal that large reasoning models (LRMs) can automatically allocate more reasoning strengths (i.e., the number of reasoning tokens) for harder problems, exhibiting difficulty-awareness for better task performance. While this automatic reasoning strength allocation phenomenon has been widely observed, its underlying mechanism remains largely unexplored. To this end, we provide explanations for this phenomenon from the perspective of model activations. We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation, with this reasoning strength causally controlled by the magnitude of a pre-allocated directional vector. Specifically, we show that the number of reasoning tokens is predictable solely based on the question activations using linear probes, indicating that LRMs estimate the required reasoning strength in advance. We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model, where the vector's magnitude modulates the reasoning strength. Subtracting this vector can lead to reduced reasoning token number and performance, while adding this vector can lead to increased reasoning token number and even improved performance. We further reveal that this direction vector consistently yields positive reasoning length prediction, and it modifies the logits of end-of-reasoning token to affect the reasoning length. Finally, we demonstrate two potential applications of our findings: overthinking behavior detection and enabling efficient reasoning on simple problems. Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors. Our code is available at https://github.com/AlphaLab-USTC/LRM-plans-CoT.
中文摘要:大型推理模型通过激活中的定向向量预先规划推理令牌数量,该向量可被操控以调整推理长度和性能,为理解其内部机制及实际应用提供了新视角。
English Summary: Large reasoning models pre-plan the number of reasoning tokens through a directional vector in their activations, which can be manipulated to control reasoning length and performance, offering insights into their internal mechanisms and practical applications.
Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Abstract:
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
中文摘要:本文提出强化学习教师(RLT)框架,通过训练模型生成针对下游学生的详细解释来规避强化学习的探索难题,相比传统方法能以更小模型实现更优性能,并保持跨任务的有效性。
English Summary: The paper introduces Reinforcement-Learned Teachers (RLTs), a framework that trains reasoning models to generate detailed explanations for distillation, overcoming exploration challenges in RL while achieving superior performance with smaller models compared to traditional methods.
Authors:Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
Abstract:
Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.
Chinese Summary: 本文提出了一种新颖的框架,利用小型草稿模型精确预测令牌和KV对的重要性,从而提升长上下文大语言模型推理效率,在保持内存使用、延迟和吞吐量改进的同时,相比现有方法实现了更高的准确性。
English Summary: This paper introduces a novel framework that uses small draft models to enhance the efficiency of long-context LLM inference by accurately predicting token and KV pair importance, leading to improved accuracy and performance in memory usage, latency, and throughput compared to existing methods.
Authors:Yanting Mei, Zhilu Zhang, Xiaohe Wu, Wangmeng Zuo
Abstract:
When shooting electronic screens, moiré patterns usually appear in captured images, which seriously affects the image quality. Existing image demoiréing methods face great challenges in removing large and heavy moiré. To address the issue, we propose to utilize Dual Camera fusion for Image Demoiréing (DCID), \ie, using the ultra-wide-angle (UW) image to assist the moiré removal of wide-angle (W) image. This is inspired by two motivations: (1) the two lenses are commonly equipped with modern smartphones, (2) the UW image generally can provide normal colors and textures when moiré exists in the W image mainly due to their different focal lengths. In particular, we propose an efficient DCID method, where a lightweight UW image encoder is integrated into an existing demoiréing network and a fast two-stage image alignment manner is present. Moreover, we construct a large-scale real-world dataset with diverse mobile phones and monitors, containing about 9,000 samples. Experiments on the dataset show our method performs better than state-of-the-art methods. Code and dataset are available at https://github.com/Mrduckk/DCID.
中文摘要:本研究提出的DCID方法通过双摄像头融合技术,利用超广角图像辅助去除广角图像中的摩尔纹,采用轻量级编码器和快速两阶段对齐方式,在包含9000个样本的新数据集上实现了优于现有方法的性能。
English Summary: The proposed DCID method uses ultra-wide-angle images to assist in removing moiré patterns from wide-angle images through dual camera fusion, achieving superior performance over existing methods with a lightweight encoder and two-stage alignment on a new 9,000-sample dataset.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
Adaptive gradient methods are computationally efficient and converge quickly, but they often suffer from poor generalization. In contrast, second-order methods enhance convergence and generalization but typically incur high computational and memory costs. In this work, we introduce NysAct, a scalable first-order gradient preconditioning method that strikes a balance between state-of-the-art first-order and second-order optimization methods. NysAct leverages an eigenvalue-shifted Nystrom method to approximate the activation covariance matrix, which is used as a preconditioning matrix, significantly reducing time and memory complexities with minimal impact on test accuracy. Our experiments show that NysAct not only achieves improved test accuracy compared to both first-order and second-order methods but also demands considerably less computational resources than existing second-order methods. Code is available at https://github.com/hseung88/nysact.
中文摘要:NysAct是一种可扩展的一阶梯度预处理方法,通过近似激活协方差矩阵平衡了计算效率与测试精度,在泛化性能上优于一阶和二阶方法,同时显著降低了资源消耗。
English Summary: NysAct is a scalable first-order gradient preconditioning method that balances computational efficiency with improved test accuracy by approximating the activation covariance matrix, outperforming both first-order and second-order methods in generalization while using fewer resources.
Authors:Franck Meyer, Kyunghoon Hur, Edward Choi
Abstract:
Despite the remarkable progress of deep-learning methods generating a target vital sign waveform from a source vital sign waveform, most existing models are designed exclusively for a specific source-to-target pair. This requires distinct model architectures, optimization procedures, and pre-processing pipelines, resulting in multiple models that hinder usability in clinical settings. To address this limitation, we propose the Multi-Directional Vital-Sign Converter (MD-ViSCo), a unified framework capable of generating any target waveform such as electrocardiogram (ECG), photoplethysmogram (PPG), or arterial blood pressure (ABP) from any single input waveform with a single model. MD-ViSCo employs a shallow 1-Dimensional U-Net integrated with a Swin Transformer that leverages Adaptive Instance Normalization (AdaIN) to capture distinct waveform styles. To evaluate the efficacy of MD-ViSCo, we conduct multi-directional waveform generation on two publicly available datasets. Our framework surpasses state-of-the-art baselines (NabNet & PPG2ABP) on average across all waveform types, lowering Mean absolute error (MAE) by 8.8% and improving Pearson correlation (PC) by 4.9% over two datasets. In addition, the generated ABP waveforms satisfy the Association for the Advancement of Medical Instrumentation (AAMI) criterion and achieve Grade B on the British Hypertension Society (BHS) standard, outperforming all baselines. By eliminating the need for developing a distinct model for each task, we believe that this work offers a unified framework that can deal with any kind of vital sign waveforms with a single model in healthcare monitoring.
中文: 研究者提出MD-ViSCo统一框架,结合浅层一维U-Net与Swin Transformer及自适应实例归一化技术,仅用单一模型即可实现任意生命体征波形间的转换,在降低误差和提升相关性方面超越现有方法,并满足医疗器械标准。
English: Researchers propose MD-ViSCo, a unified deep-learning framework using a U-Net and Swin Transformer with AdaIN to convert any vital sign waveform to another type with one model, outperforming existing methods by reducing error and improving correlation while meeting medical standards.
Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Abstract:
We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization -- a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct's competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at https://github.com/hseung88/adaact.
中文: AdaAct是一种新颖的优化算法,通过根据激活方差调整学习率来提高训练稳定性和泛化能力,有效弥合了Adam与SGD之间的性能差距,同时保持高效运行。
English: AdaAct is a novel optimization algorithm that enhances training stability and generalization by adapting learning rates based on activation variance, effectively bridging the performance gap between Adam and SGD while maintaining competitive efficiency.
Authors:Wentao Shi, Yiqing Shen
Abstract:
Large language models (LLMs) can face factual limitations when responding to time-sensitive queries about recent events that arise after their knowledge thresholds in the training corpus. Existing search-augmented approaches fall into two categories, each with distinct limitations: multi-agent search frameworks incur substantial computational overhead by separating search planning and response synthesis across multiple LLMs, while single-LLM tool-calling methods restrict themselves to sequential planned, single-query searches from sole search sources. We present Reasoning-Search (R-Search), a single-LLM search framework that unifies multi-step planning, multi-source search execution, and answer synthesis within one coherent inference process. Innovatively, it structure the output into four explicitly defined components, including reasoning steps that guide the search process (), a natural-language directed acyclic graph that represents the search plans with respect to diverse sources (), retrieved results from executing the search plans (), and synthesized final answers (). To enable effective generation of these structured outputs, we propose a specialized Reinforcement Fine-Tuning (ReFT) method based on GRPO, together with a multi-component reward function that optimizes LLM's answer correctness, structural validity of the generated DAG, and adherence to the defined output format. Experimental evaluation on FinSearchBench-24, SearchExpertBench-25, and seven Q and A benchmarks demonstrates that R-Search outperforms state-of-the-art methods, while achieving substantial efficiency gains through 70% reduction in context token usage and approximately 50% decrease in execution latency. Code is available at https://github.com/wentao0429/Reasoning-search.
Chinese: 大语言模型在处理时效性查询时存在事实性局限,而新型推理搜索框架将多步规划与答案合成统一于单一模型,在减少70%令牌使用和50%延迟的同时实现了更优性能。
English: Large language models struggle with time-sensitive queries due to outdated training data, but the new Reasoning-Search framework unifies search planning and answer synthesis in a single model, achieving superior performance while cutting token usage by 70% and latency by 50%.
Authors:Yinan Huang, Haoteng Yin, Eli Chien, Rongzhe Wei, Pan Li
Abstract:
Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach. Our code is available at https://github.com/Graph-COM/Node_DP.
中文摘要:本文提出了一个具有形式化实体级差分隐私保障的关系学习框架,通过自适应梯度裁剪和针对耦合采样的扩展分析,解决了敏感性控制和隐私放大难题。
English Summary: This paper introduces a principled framework for relational learning with formal entity-level differential privacy guarantees, addressing challenges in sensitivity control and privacy amplification through adaptive gradient clipping and extended analysis for coupled sampling.
Authors:Ao Jin, Qinyi Wang, Sijie Wen, Ya Liu, Ganghui Shen, Panfeng Huang, Fan Zhang
Abstract:
This work focuses the deployment of tethered space robot in the presence of unknown uncertainty. A data-enable framework called DEKC which contains offline training part and online execution part is proposed to deploy tethered space robot in the presence of uncertainty. The main idea of this work is modeling the unknown uncertainty as a dynamical system, which enables high accuracy and convergence of capturing uncertainty. The core part of proposed framework is a proxy model of uncertainty, which is derived from data-driven Koopman theory and is separated with controller design. In the offline stage, the lifting functions associated with Koopman operator are parameterized with deep neural networks. Then by solving an optimization problem, the lifting functions are learned from sampling data. In the online execution stage, the proxy model cooperates the learned lifting functions obtained in the offline phase to capture the unknown uncertainty. Then the output of proxy model is compensated to the baseline controller such that the effect of uncertainty can be attenuated or even eliminated. Furthermore, considering some scenarios in which the performance of proxy model may weaken, a receding-horizon scheme is proposed to update the proxy model online. Finally, the extensive numerical simulations demonstrate the effectiveness of our proposed framework. The implementation of proposed DEKC framework is publicly available at https://github.com/NPU-RCIR/DEKC.git.
中文: 本研究提出数据驱动的DEKC框架,通过Koopman理论和神经网络将未知不确定性建模为动态系统,结合离线训练与在线执行的补偿机制,有效提升系留空间机器人在不确定环境中的部署精度。
English: This study introduces a data-driven DEKC framework that models unknown uncertainties as dynamic systems using Koopman theory and neural networks, enabling precise uncertainty compensation through offline training and online execution to enhance tethered space robot deployment.
Authors:Alan N. Amin, Nate Gruver, Andrew Gordon Wilson
Abstract:
Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.
中文摘要:该研究通过揭示离散马尔可夫过程中跳跃时间分布的已知特性解释了掩码扩散的优越性,并提出SCUD模型将这一机制融入各类离散扩散过程,在图像、文本和蛋白质数据上实现了性能突破。
English Summary: The study explains that masking diffusion excels in discrete diffusion models by leveraging the known distribution of jump times in discrete Markov processes, and introduces SCUD models that incorporate this insight to outperform existing methods across various data types.
Authors:Alan N. Amin, Nate Gruver, Andrew Gordon Wilson
Abstract:
Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.
中文摘要:该研究通过揭示离散马尔可夫过程中跳跃时间分布的已知特性解释了掩码扩散的优越性,并提出SCUD模型将这一机制融入各类离散扩散过程,在图像、文本和蛋白质数据上实现了性能突破。
English Summary: The study explains that masking diffusion excels in discrete diffusion models by leveraging the known distribution of jump times in discrete Markov processes, and introduces SCUD models that incorporate this insight to outperform existing methods across various data types.
Authors:Katherine Tieu, Dongqi Fu, Zihao Li, Ross Maciejewski, Jingrui He
Abstract:
Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent state-of-the-art works to record the canonical position information. However, the current positional encoding is limited in three aspects: (1) most positional encoding methods use pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works proposed the learnable positional encoding but are still limited to the structural information, not considering the real-world time-evolving topological and feature information; (3) most positional encoding methods are equipped with transformers' attention mechanism to fully leverage their capabilities, where the dense or relational attention is often unaffordable on large-scale structured data. Hence, we aim to develop Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach transformers' performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, L-STEP obtains the leading performance in the newest large-scale TGB benchmark. Our code is available at https://github.com/kthrn22/L-STEP.
Chinese: 当前图学习中的位置编码方法在适应性、动态信息整合和计算效率方面存在局限,因此我们开发了L-STEP这一可学习的时空模型,该模型在多个数据集和设置下展现出卓越的性能、效率及鲁棒性。
English: Current positional encoding methods in graph learning face limitations in adaptability, dynamic information integration, and computational efficiency, prompting the development of L-STEP, a learnable spatial-temporal model that demonstrates superior performance, efficiency, and robustness across multiple datasets and settings.
Authors:Kangning Yang, Ling Ouyang, Huiming Sun, Jie Cai, Lan Fu, Jiaming Ding, Chiu Man Ho, Zibo Meng
Abstract:
Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs are of high quality, perfectly aligned, and represent natural and diverse scenarios. Following this paradigm, we collect a Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which contains 1,000 high-quality transmission-reflection image pairs collected in the wild. Through the analysis of several reflection removal methods and benchmark evaluation experiments on our dataset, we demonstrate its effectiveness in improving robustness in challenging real-world environments. Our dataset is available at https://github.com/caijie0620/OpenRR-1k.
中文: 本文提出了一种新颖、经济高效的高质量反射数据集采集方法,并发布了OpenRR-1k数据集,有效提升了反射消除算法在真实场景中的鲁棒性。
English: This paper introduces a novel, cost-effective method for creating high-quality reflection datasets and presents the OpenRR-1k dataset, which enhances reflection removal algorithms' robustness in real-world settings.
Authors:Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, Bo Han
Abstract:
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.
中文摘要:AR-Bench是一个专为评估大语言模型主动推理能力设计的新基准,揭示了模型在获取和利用外部信息方面相比被动推理存在显著困难,并强调了改进交互学习等方法的必要性。
English Summary: AR-Bench is a new benchmark designed to evaluate large language models' active reasoning skills, revealing their significant struggles in acquiring and using external information compared to passive reasoning, and highlighting the need for improved methodologies like interactive learning.
Authors:Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han
Abstract:
Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON's ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.
Chinese: ECON通过将多智能体协调建模为不完全信息博弈,提出了一种结合分布式推理与集中输出的分层强化学习范式,在显著降低通信成本的同时实现了更强的性能与理论保障。
English: ECON introduces a hierarchical reinforcement-learning framework that models multi-LLM coordination as an incomplete-information game, achieving superior performance with a tighter regret bound and eliminating costly inter-agent communication.
Authors:Daniel H. Pak, Shubh Thaker, Kyle Baylous, Xiaoran Zhang, Danny Bluestein, James S. Duncan
Abstract:
High-quality volumetric meshing from medical images is a key bottleneck for physics-based simulations in personalized medicine. For volumetric meshing of complex medical structures, recent studies have often utilized deep learning (DL)-based template deformation approaches to enable fast test-time generation with high spatial accuracy. However, these approaches still exhibit limitations, such as limited flexibility at high-curvature areas and unrealistic inter-part distances. In this study, we introduce a simple yet effective snap-and-tune strategy that sequentially applies DL and test-time optimization, which combines fast initial shape fitting with more detailed sample-specific mesh corrections. Our method provides significant improvements in both spatial accuracy and mesh quality, while being fully automated and requiring no additional training labels. Finally, we demonstrate the versatility and usefulness of our newly generated meshes via solid mechanics simulations in two different software platforms. Our code is available at https://github.com/danpak94/Deep-Cardiac-Volumetric-Mesh.
Chinese: 本研究提出一种结合深度学习与测试时优化的快速调整策略,通过自动化体网格生成显著提升了医学模拟中的空间精度和网格质量,并在固体力学应用中验证了其有效性。
English: This study introduces a snap-and-tune strategy combining deep learning with test-time optimization to enhance spatial accuracy and mesh quality in automated volumetric meshing for medical simulations, demonstrating improved performance in solid mechanics applications.
Authors:Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta
Abstract:
Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].
中文: 指令调优的多模态大语言模型在自然观影场景中显著优于非指令调优模型,其任务特定表征能精确对应大脑层级处理机制,为大脑与计算系统的映射研究开辟了新途径。
English: Instruction-tuned multimodal large language models significantly outperform non-instruction-tuned models in aligning with brain activity during naturalistic movie viewing, showing hierarchical layer correspondence and task-specific representation disentanglement that advances brain-computation mapping.
Authors:Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen
Abstract:
Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
中文:大型语言模型和人工智能系统的进展推动了复合AI系统的发展,这些系统通过整合多个组件处理复杂任务,但其日益增长的复杂性带来了优化单个组件及其交互的挑战,因此本文系统回顾了包括传统技术和新兴自然语言反馈方法在内的优化策略。
English: Recent progress in large language models and AI systems has advanced compound AI systems, which integrate multiple components to handle complex tasks, yet their growing complexity poses challenges in optimizing both individual elements and their interactions, prompting a systematic review of optimization methods including traditional techniques and emerging natural language feedback approaches.
Authors:Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen
Abstract:
Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
中文:大型语言模型和人工智能系统的进展推动了复合AI系统的发展,这些系统通过整合多个组件处理复杂任务,但其日益增长的复杂性带来了优化单个组件及其交互的挑战,因此本文系统回顾了包括传统技术和新兴自然语言反馈方法在内的优化策略。
English: Recent progress in large language models and AI systems has advanced compound AI systems, which integrate multiple components to handle complex tasks, yet their growing complexity poses challenges in optimizing both individual elements and their interactions, prompting a systematic review of optimization methods including traditional techniques and emerging natural language feedback approaches.
Authors:Huixin Zhan, Jason H. Moore
Abstract:
Surgeons exhibit distinct operating styles shaped by training, experience, and motor behavior-yet most surgical AI systems overlook this personalization signal. We propose a novel agentic modeling approach for surgeon-specific behavior prediction in robotic surgery, combining a discrete diffusion framework with a vision-language-action (VLA) pipeline. Gesture prediction is framed as a structured sequence denoising task, conditioned on multimodal inputs including surgical video, intent language, and personalized embeddings of surgeon identity and skill. These embeddings are encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
Chinese: 本研究提出了一种用于机器人手术的个性化AI模型,通过多模态方法预测外科医生的特定手势,在实现高精度的同时揭示了性能提升与身份泄露隐私风险之间的权衡关系。
English: This study introduces a personalized AI model for robotic surgery that predicts surgeon-specific gestures using a multimodal approach, achieving high accuracy while revealing a trade-off between enhanced performance and increased privacy risks from identity leakage.
Authors:Chupei Wang, Jiaqiu Vince Sun
Abstract:
Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.
中文总结:该研究发现大型语言模型存在前摄干扰问题,即先前的信息会干扰对后续更新的回忆,即使目标数据位置明确,随着干扰累积,检索准确率仍会大幅下降。
English Summary: The study reveals that large language models suffer from proactive interference, where earlier information disrupts recall of recent updates, causing retrieval accuracy to decline significantly as interference accumulates despite clear positioning of target data.
Authors:Sunny Gupta, Nikita Jangid, Amit Sethi
Abstract:
Federated Learning (FL) often suffers from severe performance degradation when faced with non-IID data, largely due to local classifier bias. Traditional remedies such as global model regularization or layer freezing either incur high computational costs or struggle to adapt to feature shifts. In this work, we propose UniVarFL, a novel FL framework that emulates IID-like training dynamics directly at the client level, eliminating the need for global model dependency. UniVarFL leverages two complementary regularization strategies during local training: Classifier Variance Regularization, which aligns class-wise probability distributions with those expected under IID conditions, effectively mitigating local classifier bias; and Hyperspherical Uniformity Regularization, which encourages a uniform distribution of feature representations across the hypersphere, thereby enhancing the model's ability to generalize under diverse data distributions. Extensive experiments on multiple benchmark datasets demonstrate that UniVarFL outperforms existing methods in accuracy, highlighting its potential as a highly scalable and efficient solution for real-world FL deployments, especially in resource-constrained settings. Code: https://github.com/sunnyinAI/UniVarFL
Chinese: UniVarFL提出了一种新颖的联邦学习框架,通过双重正则化策略有效缓解本地分类器偏差并提升特征泛化能力,在非独立同分布数据场景下实现了更优的准确性与可扩展性。
English: UniVarFL introduces a novel federated learning framework that employs dual regularization strategies to mitigate local classifier bias and enhance feature generalization, achieving superior accuracy and scalability in non-IID settings.
Authors:Hadi Reisizadeh, Jinghan Jia, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Sijia Liu, Mingyi Hong
Abstract:
Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which \textit{unlearns} certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model's utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (\texttt{BLUR}), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that \texttt{BLUR} consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at https://github.com/OptimAI-Lab/BLURLLMUnlearning.
中文: 本文针对大语言模型的知识消除问题提出分层双级优化框架,通过BLUR算法优先实现特定知识遗忘并保持模型性能,在多项任务中显著优于现有方法。
English: This paper introduces a hierarchical bi-level optimization approach for large language model unlearning, proposing the BLUR algorithm that prioritizes forgetting specific knowledge while maintaining model utility, which significantly outperforms existing methods across various tasks.
Authors:Lijing Zhu, Qizhen Lan, Qing Tian, Wenbo Sun, Li Yang, Lu Xia, Yixin Xie, Xi Xiao, Tiehang Duan, Cui Tao, Shuteng Niu
Abstract:
Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main
中文: ETT-CKGE通过引入可学习的任务驱动标记,利用简单矩阵运算实现快照间高效知识迁移,在显著提升训练效率和可扩展性的同时,取得了优于或媲美现有方法的预测性能。
English: ETT-CKGE introduces learnable task-driven tokens to enable efficient knowledge transfer between snapshots through simple matrix operations, achieving superior performance while significantly improving training efficiency and scalability compared to existing methods.
Authors:Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson, Zhen Han, Vassilis N. Ioannidis, Soji Adeshina, Huzefa Rangwala
Abstract:
Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.
中文: 分层词汇图(HLG)通过构建三层索引和两种互补检索器,解决了RAG在跨语义分散文档整合信息时的不足,在多个数据集上平均将检索性能提升了23.1%。
English: The Hierarchical Lexical Graph (HLG) addresses RAG's limitations in connecting information across distant documents by creating a three-tier index and two complementary retrievers, significantly improving retrieval performance by 23.1% on average across multiple datasets.
Authors:Ziheng Qin, Hailun Xu, Wei Chee Yew, Qi Jia, Yang Luo, Kanchan Sarkar, Danhui Guan, Kai Wang, Yang You
Abstract:
Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32\% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50\% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at https://github.com/NUS-HPC-AI-Lab/Info-Coevolution/.
Chinese: 提出的Info-Coevolution框架通过选择性在线标注实现模型与数据的协同进化,在ImageNet-1K数据集上无需性能损失即可降低32%的标注成本,同时消除偏差且无需手动调整比例。
English: The proposed Info-Coevolution framework enables efficient model-data coevolution through selective online annotation, reducing ImageNet-1K annotation costs by 32% without performance loss while eliminating bias and manual ratio tuning.
Authors:Ye Zhu, Duo Xu, Zhiwei Deng, Jonathan C. Tan, Olga Russakovsky
Abstract:
We study Diffusion Schrödinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.
中文: Astro-DSB模型作为针对天体物理系统定制的扩散薛定谔桥变体,在恒星形成研究中提升了可解释性与预测能力,并证明了生成模型在处理未知天体物理条件时的优越性。
English: The Astro-DSB model, a tailored Diffusion Schrödinger Bridge variant, enhances interpretability and prediction in star formation studies while demonstrating generative modeling's superiority in handling unseen astrophysical conditions.
Authors:Livio Tenze, Enrique Canessa
Abstract:
A new extended version of the altiro3D C++ Library -- initially developed to get glass-free holographic displays starting from 2D images -- is here introduced aiming to deal with 3D video streams from either 2D webcam images or flat video files. These streams are processed in real-time to synthesize light-fields (in Native format) and feed realistic 3D experiences. The core function needed to recreate multiviews consists on the use of MiDaS Convolutional Neural Network (CNN), which allows to extract a depth map from a single 2D image. Artificial Intelligence (AI) computing techniques are applied to improve the overall performance of the extended altiro3D Library. Thus, altiro3D can now treat standard images, video streams or screen portions of a Desktop where other apps may be also running (like web browsers, video chats, etc) and render them into 3D. To achieve the latter, a screen region need to be selected in order to feed the output directly into a light-field 3D device such as Looking Glass (LG) Portrait. In order to simplify the acquisition of a Desktop screen area by the user, a multi-platform Graphical User Interface has been also implemented. Sources available at: https://github.com/canessae/altiro3D/releases/tag/2.0.0
中文:扩展版 altiro3D C++ 库现可利用人工智能深度提取技术,实时处理二维图像、视频流或桌面屏幕区域,生成光场以驱动如 Looking Glass Portrait 等 3D 显示设备,并配备了跨平台图形界面。
English: The extended altiro3D C++ Library now processes 2D images, video streams, or desktop screen areas in real-time using AI-driven depth extraction to generate light-fields for 3D displays like Looking Glass Portrait, supported by a multi-platform GUI.
Authors:Songqiao Hu, Zeyi Liu, Xiao He
Abstract:
The change in data distribution over time, also known as concept drift, poses a significant challenge to the reliability of online learning methods. Existing methods typically require model retraining or drift detection, both of which demand high computational costs and are often unsuitable for real-time applications. To address these limitations, a lightweight, fast and efficient random vector functional-link network termed Lite-RVFL is proposed, capable of adapting to concept drift without drift detection and retraining. Lite-RVFL introduces a novel objective function that assigns weights exponentially increasing to new samples, thereby emphasizing recent data and enabling timely adaptation. Theoretical analysis confirms the feasibility of this objective function for drift adaptation, and an efficient incremental update rule is derived. Experimental results on a real-world safety assessment task validate the efficiency, effectiveness in adapting to drift, and potential to capture temporal patterns of Lite-RVFL. The source code is available at https://github.com/songqiaohu/Lite-RVFL.
中文: Lite-RVFL是一种轻量级网络,通过指数加权函数强调新数据来适应概念漂移,无需重新训练或漂移检测,并在实际应用中验证了其高效性。
English: Lite-RVFL is a lightweight network that adapts to concept drift by emphasizing recent data through an exponential weighting function, eliminating the need for retraining or drift detection while proving efficient in real-world applications.
Authors:Yiming Wang, Hao Peng, Senzhang Wang, Haohua Du, Chunyang Liu, Jia Wu, Guanlin Wu
Abstract:
Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the models flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a SpatioTemporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at https://github.com/RingBDStack/STAMImupter.
Chinese: 本文提出STAMImputer模型,通过动态图生成和专家混合框架有效处理交通数据中的块状缺失问题,在四个数据集上的实验表明其性能显著优于现有最优方法。
English: This paper introduces STAMImputer, a SpatioTemporal Attention Mixture of experts network that effectively handles block-wise missing traffic data through dynamic graph generation and achieves superior performance over existing methods.
Authors:Anh-Quan Cao, Ivan Lopes, Raoul de Charette
Abstract:
Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
中文摘要:StableMTL提出了一种基于扩散模型的零样本多任务学习方法,利用部分标注的合成数据集进行训练,通过统一潜在损失和任务注意力机制促进任务间协同,在多个基准测试中超越基线模型。
English Summary: StableMTL introduces a zero-shot multi-task learning method using diffusion models that trains on synthetic datasets with partial labels and employs a unified latent loss with task-attention for cross-task synergy, outperforming baselines across multiple benchmarks.
Authors:Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Abstract:
Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at https://github.com/boyazeng/weight_memorization.
中文: 本研究发现,现有生成模型在合成神经网络权重时主要依赖记忆或简单插值训练检查点,未能超越基础方法,这源于数据限制及结构先验利用不足。
English: This study finds that current generative models for synthesizing neural network weights primarily produce memorized or interpolated versions of training checkpoints, failing to outperform simple baselines due to data limitations and inadequate use of structural priors.
Authors:Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Abstract:
Generative models have recently been explored for synthesizing neural network weights. These approaches take neural network checkpoints as training data and aim to generate high-performing weights during inference. In this work, we examine four representative, well-known methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or, at best, simple interpolations of the training checkpoints. Moreover, they fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further analysis suggests that this memorization might result from limited data, overparameterized models, and the underuse of structural priors specific to weight data. These findings highlight the need for more careful design and rigorous evaluation of generative models when applied to new domains. Our code is available at https://github.com/boyazeng/weight_memorization.
中文: 本研究发现,现有生成模型在合成神经网络权重时主要依赖记忆或简单插值训练检查点,未能超越基础方法,这源于数据限制及结构先验利用不足。
English: This study finds that current generative models for synthesizing neural network weights primarily produce memorized or interpolated versions of training checkpoints, failing to outperform simple baselines due to data limitations and inadequate use of structural priors.
Authors:Haoguang Lu, Jiacheng Chen, Zhenguo Yang, Aurele Tohokantche Gnanha, Fu Lee Wang, Li Qing, Xudong Mao
Abstract:
Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still rely on text prompts describing the change within paired examples or learning implicit text-based editing instructions. In this paper, we introduce PairEdit, a novel visual editing method designed to effectively learn complex editing semantics from a limited number of image pairs or even a single image pair, without using any textual guidance. We propose a target noise prediction that explicitly models semantic variations within paired images through a guidance direction term. Moreover, we introduce a content-preserving noise schedule to facilitate more effective semantic learning. We also propose optimizing distinct LoRAs to disentangle the learning of semantic variations from content. Extensive qualitative and quantitative evaluations demonstrate that PairEdit successfully learns intricate semantics while significantly improving content consistency compared to baseline methods. Code will be available at https://github.com/xudonmao/PairEdit.
Chinese: PairEdit是一种新颖的视觉编辑方法,无需文本指导即可从少量图像对中学习复杂的编辑语义,通过目标噪声预测和内容保留技术有效提升语义学习能力和内容一致性。
English: PairEdit is a novel visual editing method that learns complex editing semantics from a limited number of image pairs without textual guidance, utilizing target noise prediction and content-preserving techniques to enhance semantic learning and content consistency.
Authors:Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
Abstract:
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}
中文摘要:本文提出的温度调节跨模态注意力方法(TACA)有效解决了多模态扩散变换器中的注意力失衡问题,以微小计算成本显著提升了FLUX和SD3.5等模型的图文对齐效果。
English Summary: The proposed Temperature-Adjusted Cross-modal Attention (TACA) method effectively addresses attention imbalance issues in multimodal diffusion transformers, significantly improving text-image alignment in models like FLUX and SD3.5 with minimal computational cost.
Authors:Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen
Abstract:
Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.
Chinese: 针对当前多模态大语言模型在处理长视频或复杂视频时存在的不足,我们提出了CyberV框架,通过引入控制论循环实现推理过程中的自适应监控、自我纠正和动态资源分配,无需重新训练即可显著提升模型性能。
English: To overcome the limitations of current Multimodal Large Language Models (MLLMs) in processing long or complex videos, we propose CyberV, a cybernetic framework that enables adaptive self-monitoring, self-correction, and dynamic resource allocation during inference, significantly improving performance without retraining.
Authors:Jacob Helwig, Sai Sreeharsha Adavi, Xuan Zhang, Yuchao Lin, Felix S. Chim, Luke Takeshi Vizzini, Haiyang Yu, Muhammad Hasnain, Saykat Kumar Biswas, John J. Holloway, Narendra Singh, N. K. Anand, Swagnik Guhathakurta, Shuiwang Ji
Abstract:
We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. As ShockCast is the first framework for learning high-speed flows, we evaluate our methods by generating two supersonic flow datasets, available at https://huggingface.co/datasets/divelab. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
中文: 本研究提出ShockCast,一种两阶段机器学习框架,通过自适应时间步长来高效模拟含激波的高速流动,利用时间步长预测和条件策略解决计算挑战。
English: This study introduces ShockCast, a two-phase machine learning framework that employs adaptive time-stepping to efficiently model high-speed flows with shock waves, addressing computational challenges through timestep prediction and conditioning strategies.
Authors:Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at https://github.com/Cuzyoung/SpaCE-10.
中文: 针对现有基准难以全面评估多模态大模型空间智能的问题,SpaCE-10基准通过定义10项原子空间能力和8项组合能力,构建了包含5,000余对问答数据的评估体系,发现现有模型在计数能力等关键方面仍远逊于人类表现。
English: To address the limitations of existing benchmarks in evaluating multimodal large language models' spatial intelligence, the SpaCE-10 benchmark introduces 10 atomic and 8 compositional spatial capabilities with over 5,000 QA pairs, revealing that current models significantly trail human performance, particularly in counting skills.
Authors:Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu
Abstract:
Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at https://github.com/vinsontang1/SlideCoder.
中文摘要:本文提出SlideCoder框架,通过结合布局感知算法和分层检索方法,实现了从参考图像生成可编辑幻灯片的功能,在多个评估维度上显著优于现有方法。
English Summary: This paper introduces SlideCoder, a novel framework that generates editable slides from reference images by integrating layout-aware algorithms and hierarchical retrieval methods, achieving significant improvements over existing methods.
Authors:Christopher Subia-Waud
Abstract:
Current AutoML platforms leave substantial performance untapped. Testing 180 fine-tuning tasks across models from 70M to 70B parameters, we found that HuggingFace AutoTrain, TogetherAI, Databricks, and Google Cloud consistently produce suboptimal configurations. Gradients, built on the Bittensor network, attacks this problem through competition. Independent miners race to find optimal hyperparameters, earning rewards proportional to their models' performance. This tournament drives exploration of configuration spaces that single-strategy methods never examine. In our experiments, Gradients achieved a 100\% win rate against TogetherAI, Databricks, and Google Cloud, and beat HuggingFace AutoTrain in 82.8\% of experiments. Mean improvements reached 42.1\% against commercial platforms. Retrieval-augmented generation tasks saw 30-40\% gains; diffusion models improved 23.4\% on person-specific generation. When miners compete for rewards, they develop optimization strategies that centralized approaches overlook. These findings demonstrate that decentralized systems with economic incentives can systematically outperform traditional AutoML, suggesting market dynamics may be key to achieving superior fine-tuning results. Code is available at https://github.com/rayonlabs/G.O.D.
中文: 现有AutoML平台表现不佳,而基于Bittensor网络的去中心化系统Gradients通过竞争性矿工的经济激励,在超参数优化上表现卓越,平均提升达42.1%,显著优于传统方法。
English: Current AutoML platforms underperform, but Gradients, a decentralized system using competitive miners on the Bittensor network, consistently outperforms them by up to 42.1% through economic incentives that drive superior hyperparameter optimization.
Authors:Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan
Abstract:
Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN).
中文: CausalPFN是一种基于Transformer的模型,通过模拟数据训练实现观测数据的自动因果效应估计,无需手动调整即可在基准测试中取得优异性能,并提供基于贝叶斯原则的校准不确定性评估以支持可靠决策。
English: CausalPFN is a transformer-based model trained on simulated data to provide automated causal effect estimation for observational datasets without requiring manual tuning, delivering superior performance on benchmarks and calibrated uncertainty estimates for reliable decision-making.
Authors:Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X. -F. Ye, Molei Tao
Abstract:
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
Chinese: 该框架通过解耦噪声调度,使多模态扩散模型能够原生生成跨模态的耦合数据,无需依赖外部预处理,并在单一模型中同时支持无条件生成和条件生成。
English: The proposed framework enables multimodal diffusion models to generate coupled data natively across different modalities without relying on external preprocessing, using a decoupled noise schedule to support both unconditional and conditioned generation within a single model.
Authors:Sifan Wang, Zehao Dou, Tong-Rui Liu, Lu Lu
Abstract:
Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at https://github.com/sifanexisted/fundiff.
中文: FunDiff是一种创新的函数空间生成框架,它结合了潜在扩散与函数自编码器,能够生成连续且符合物理规律的函数,在多种物理应用中实现了理论最优性和鲁棒性。
English: FunDiff is a novel generative framework for function spaces that integrates latent diffusion with function autoencoders to generate continuous, physically consistent functions while ensuring theoretical optimality and robustness across various physical applications.
Authors:Muhammad Ahmed Humais, Xiaoqian Huang, Hussain Sajwani, Sajid Javed, Yahya Zweiri
Abstract:
Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at https://github.com/AhmedHumais/E-STMFlow
中文: 本研究提出的时空状态空间模型(STSSM)能高效捕捉事件数据中的时空关联,在保持光流估计性能竞争力的同时,实现了显著提升的推理速度和大幅降低的计算开销。
English: This study introduces a Spatio-Temporal State Space Model (STSSM) that efficiently captures spatio-temporal correlations in event data, achieving significantly faster inference and lower computational costs while maintaining competitive optical flow performance on benchmarks.
Authors:Jinxi Li, Ziyang Song, Siyuan Zhou, Bo Yang
Abstract:
In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types. In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.
中文: 本文提出FreeGave方法,无需任何物体先验知识,仅通过多视角视频学习复杂动态3D场景的物理特性,采用物理编码和无散度模块估计速度场,在帧预测和运动分割任务中展现出优越性能。
English: This paper introduces FreeGave, a method that learns the physics of complex dynamic 3D scenes from multi-view videos without requiring object priors, using a physics code and divergence-free module to estimate velocity fields and demonstrating superior performance in future frame extrapolation and motion segmentation.
Authors:Zihui Zhang, Weisheng Dai, Hongtao Wen, Bo Yang
Abstract:
We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
中文: 本文提出LogoSP方法,通过结合局部和全局点特征,在频域中依据全局模式对超点进行分组以生成精确的语义伪标签,在无需人工标注的情况下实现了无监督3D语义分割的最优性能。
English: This paper introduces LogoSP, an unsupervised 3D semantic segmentation method that learns from both local and global point features by grouping superpoints based on frequency-domain patterns to generate accurate pseudo-labels, achieving state-of-the-art performance on multiple datasets without human annotations.
Authors:Shijie Wang, Yilun Zhang, Zeyu Lai, Dexing Kong
Abstract:
Multimodal large language models (MLLMs) have shown great potential in general domains but perform poorly in some specific domains due to a lack of domain-specific data, such as image-text data or vedio-text data. In some specific domains, there is abundant graphic and textual data scattered around, but lacks standardized arrangement. In the field of medical ultrasound, there are ultrasonic diagnostic books, ultrasonic clinical guidelines, ultrasonic diagnostic reports, and so on. However, these ultrasonic materials are often saved in the forms of PDF, images, etc., and cannot be directly used for the training of MLLMs. This paper proposes a novel image-text reasoning supervised fine-tuning data generation pipeline to create specific domain quadruplets (image, question, thinking trace, and answer) from domain-specific materials. A medical ultrasound domain dataset ReMUD is established, containing over 45,000 reasoning and non-reasoning supervised fine-tuning Question Answering (QA) and Visual Question Answering (VQA) data. The ReMUD-7B model, fine-tuned on Qwen2.5-VL-7B-Instruct, outperforms general-domain MLLMs in medical ultrasound field. To facilitate research, the ReMUD dataset, data generation codebase, and ReMUD-7B parameters will be released at https://github.com/ShiDaizi/ReMUD, addressing the data shortage issue in specific domain MLLMs.
中文: 本文提出了一种新颖的数据生成流程,能从非结构化材料中创建特定领域四元组,构建了包含超过45,000条问答数据的ReMUD数据集,并证明基于该数据集微调的ReMUD-7B模型在医学超声领域优于通用多模态大模型。
English: This paper introduces a novel data generation pipeline that creates domain-specific quadruplets from unstructured materials, producing the ReMUD dataset with over 45,000 QA/VQA entries and demonstrating that the fine-tuned ReMUD-7B model outperforms general MLLMs in medical ultrasound applications.
Authors:Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
Abstract:
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm
Chinese: 大语言模型受限于逐词预测,但新提出的概念感知微调方法通过多词学习增强了概念理解能力,在多项任务中表现优异。
English: Large language models are limited by next-token prediction, but the new Concept-Aware Fine-Tuning method enables multi-token learning for improved conceptual understanding across various tasks.
Authors:Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin
Abstract:
Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.
Chinese: 本文提出了隐式视频问答(I-VQA)新任务及数据集,针对无法直接获取显式视觉证据的场景,并设计了结合动作与意图双线索推理的隐式推理模型(IRM),在多项任务中性能超越现有先进模型。
English: The paper introduces a new task and dataset called Implicit Video Question Answering (I-VQA), which addresses questions requiring symbolic or deeper understanding beyond explicit visual evidence, and proposes a dual-stream Implicit Reasoning Model (IRM) that outperforms existing models in performance.
Authors:Jie Bao, Chuangyin Dang, Rui Luo, Hanwei Zhang, Zhixin Zhou
Abstract:
As deep learning models are increasingly deployed in high-risk applications, robust defenses against adversarial attacks and reliable performance guarantees become paramount. Moreover, accuracy alone does not provide sufficient assurance or reliable uncertainty estimates for these models. This study advances adversarial training by leveraging principles from Conformal Prediction. Specifically, we develop an adversarial attack method, termed OPSA (OPtimal Size Attack), designed to reduce the efficiency of conformal prediction at any significance level by maximizing model uncertainty without requiring coverage guarantees. Correspondingly, we introduce OPSA-AT (Adversarial Training), a defense strategy that integrates OPSA within a novel conformal training paradigm. Experimental evaluations demonstrate that our OPSA attack method induces greater uncertainty compared to baseline approaches for various defenses. Conversely, our OPSA-AT defensive model significantly enhances robustness not only against OPSA but also other adversarial attacks, and maintains reliable prediction. Our findings highlight the effectiveness of this integrated approach for developing trustworthy and resilient deep learning models for safety-critical domains. Our code is available at https://github.com/bjbbbb/Enhancing-Adversarial-Robustness-with-Conformal-Prediction.
中文: 本研究提出了OPSA对抗性攻击方法以增加模型不确定性,并开发了OPSA-AT防御策略,通过整合对抗训练与保形预测来提升模型鲁棒性并保持可靠预测性能。
English: This study introduces OPSA, an adversarial attack that increases model uncertainty, and OPSA-AT, a defense strategy using conformal training to enhance robustness and maintain reliable predictions in deep learning models.
Authors:Yixuan Huang, Jie Yang, Shuqiang Xia, Chao-Kai Wen, Shi Jin
Abstract:
The low-altitude economy is emerging as a key driver of future economic growth, necessitating effective flight activity surveillance using existing mobile cellular network sensing capabilities. However, traditional monostatic and localizationbased sensing methods face challenges in fusing sensing results and matching channel parameters. To address these challenges, we model low-altitude surveillance as a compressed sensing (CS)-based imaging problem by leveraging the cooperation of multiple base stations and the inherent sparsity of aerial images. Additionally, we derive the point spread function to analyze the influences of different antenna, subcarrier, and resolution settings on the imaging performance. Given the random spatial distribution of unmanned aerial vehicles (UAVs), we propose a physics-embedded learning method to mitigate off-grid errors in traditional CS-based approaches. Furthermore, to enhance rare UAV detection in vast low-altitude airspace, we integrate an online hard example mining scheme into the loss function design, enabling the network to adaptively focus on samples with significant discrepancies from the ground truth during training. Simulation results demonstrate the effectiveness of the proposed low-altitude surveillance framework. The proposed physicsembedded learning algorithm achieves a 97.55% detection rate, significantly outperforming traditional CS-based methods under off-grid conditions. Part of the source code for this paper will be soon accessed at https://github.com/kiwi1944/LAEImager.
中文: 本文提出了一种基于压缩感知的低空监视成像框架,通过多基站协作和物理嵌入学习方法克服离网误差并提升无人机检测能力,在仿真中实现了97.55%的检测率。
English: This paper proposes a compressed sensing-based imaging framework for low-altitude surveillance that utilizes multi-base station cooperation and physics-embedded learning to overcome off-grid errors and enhance UAV detection, achieving a 97.55% detection rate in simulations.
Authors:Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Abstract:
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at https://github.com/yannqi/RCTS-RAG.
Chinese: 本研究提出名为RCTS的多模态RAG框架,通过构建富含推理背景的知识库和树搜索重排序方法增强大视觉语言模型,在多项VQA任务中实现了最先进的性能表现。
English: This study introduces RCTS, a multimodal RAG framework that enhances Large Vision Language Models by enriching the knowledge base with reasoning contexts and employing a tree search re-ranking method, achieving state-of-the-art performance on VQA tasks.
Authors:Mohamed Djilani, Nassim Ali Ousalah, Nidhal Eddine Chenni
Abstract:
We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at https://github.com/meddjilani/FashionRecommender
该系统通过融合服装视觉分析与用户行为模拟,实现了兼顾个人风格与流行趋势的个性化时尚推荐。
This system combines visual garment analysis with user behavior simulation to deliver personalized fashion recommendations that balance individual style with current trends.
Authors:Adam Breuer
Abstract:
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) -- exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Chinese: 本文首次提出具有可证明保证的LDA主题推断实用算法,通过创新组合方法实现指数级加速,同时保持可解释性以支持下游因果分析。
English: This paper introduces the first practical algorithms with provable guarantees for LDA topic inference, using a novel combinatorial approach that achieves exponential speedup and maintains interpretability for downstream causal analysis.
Authors:Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim
Abstract:
Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.
中文: 提出的图辅助拼接(GAS)框架通过将子目标选择构建为图搜索问题,利用时序距离表示和时序效率指标优化状态转换拼接,在多项任务中显著超越了现有离线分层强化学习方法。
English: The proposed Graph-Assisted Stitching (GAS) framework replaces explicit high-level policy learning with graph-based subgoal selection, utilizing Temporal Distance Representation and Temporal Efficiency metrics to enhance transition stitching and achieve superior performance across various tasks compared to prior offline hierarchical reinforcement learning methods.
Authors:Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, Fatma Güney
Abstract:
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.
中文: ETA采用异步双系统架构,将密集计算前移至历史时间步,使大模型能及时响应自动驾驶的每一帧,在保持近实时推理速度的同时将性能提升8%。
English: ETA introduces an asynchronous dual-system that shifts intensive computations to previous time steps, enabling large models to respond promptly in self-driving systems while maintaining near-real-time inference speeds and improving performance by 8%.
Authors:Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, Jungyeul Park
Abstract:
Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as $\texttt{errant}$, face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specific extensions, enabling both consistency and flexibility across languages. We reimplement $\texttt{errant}$ using $\texttt{stanza}$ to support broader multilingual coverage, and demonstrate the framework's adaptability through applications to English, German, Czech, Korean, and Chinese, ranging from general-purpose annotation to more customized linguistic refinements. This work supports scalable and interpretable GEC annotation across languages and promotes more consistent evaluation in multilingual settings. The complete codebase and annotation tools can be accessed at https://github.com/open-writing-evaluation/jp_errant_bea.
中文: 本文提出了一种标准化的模块化多语言语法错误标注框架,结合语言无关的基础和特定语言扩展,在英语、德语、捷克语、韩语和汉语等多种语言中实现了标注的一致性与灵活性。
English: This paper introduces a standardized, modular framework for multilingual grammatical error annotation that combines language-agnostic foundations with language-specific extensions, enhancing consistency and flexibility across diverse languages including English, German, Czech, Korean, and Chinese.
Authors:Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang
Abstract:
As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.
中文摘要:FAST方法通过将训练过程与指令模型的数据分布和激活模式对齐,显著提升了稀疏自编码器在指令模型上的重建质量和特征可解释性,同时揭示了通过特殊令牌干预来精细调控模型行为的新途径。
English Summary: The proposed FAST method significantly enhances sparse autoencoder performance for instruct models by aligning training with their data distribution and activation patterns, achieving superior reconstruction accuracy and feature interpretability while revealing new opportunities for model behavior control.
Authors:Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu
Abstract:
(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages:
(1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs.
(2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states.
Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.
中文: MCPWorld是首个针对API、GUI及混合模式计算机使用代理的自动化测试平台,通过白盒应用实现可靠的任务验证并扩展代理功能,推动下一代人机交互代理的标准化评测。
English: MCPWorld is the first automated testbed designed for evaluating computer use agents across API, GUI, and hybrid environments, utilizing white-box applications to enable robust task verification and broaden agent capabilities.
Authors:Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu
Abstract:
Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration.
中文摘要:本研究提出通过生成带标注步骤的结构化数据来增强大语言模型的数学推理能力,创建了包含3.9万问题数据集和6.1千问题的高难度基准测试,实验表明模型性能随推理链增长而下降,微调结果验证了该数据集的有效性。
English Summary: This study introduces a method to improve LLM mathematical reasoning by generating structured data with labeled steps, creating a 39K-problem dataset and a 6.1K-problem benchmark that shows performance declines with longer reasoning chains, with fine-tuning experiments confirming the dataset's effectiveness.
Authors:Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong
Abstract:
While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.
中文摘要:TreeReview框架通过将论文评审建模为分层双向问答过程,在提升评审深度和全面性的同时,比传统方法减少高达80%的计算资源消耗。
English Summary: TreeReview is a hierarchical framework that enhances peer review efficiency by decomposing it into a bidirectional question-answering process, achieving comprehensive feedback while reducing computational costs by up to 80%.
Authors:Yuchong Long, Wen Sun, Ningxiao Sun, Wenxiao Wang, Chao Li, Shan Yin
Abstract:
Automated pollen recognition is vital to paleoclimatology, biodiversity monitoring, and public health, yet conventional methods are hampered by inefficiency and subjectivity. Existing deep learning models often struggle to achieve the requisite localization accuracy for microscopic targets like pollen, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this limitation, we introduce HieraEdgeNet, a multi-scale edge-enhancement framework. The framework's core innovation is the introduction of three synergistic modules: the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale pyramid of edge features that corresponds to the semantic hierarchy at early network stages; the Synergistic Edge Fusion (SEF) module, for deeply fusing these edge priors with semantic information at each respective scale; and the Cross Stage Partial Omni-Kernel Module (CSPOKM), which maximally refines the most detail-rich feature layers using an Omni-Kernel operator - comprising anisotropic large-kernel convolutions and mixed-domain attention - all within a computationally efficient Cross-Stage Partial (CSP) framework. On a large-scale dataset comprising 120 pollen classes, HieraEdgeNet achieves a mean Average Precision (mAP@.5) of 0.9501, significantly outperforming state-of-the-art baseline models such as YOLOv12n and RT-DETR. Furthermore, qualitative analysis confirms that our approach generates feature representations that are more precisely focused on object boundaries. By systematically integrating edge information, HieraEdgeNet provides a robust and powerful solution for high-precision, high-efficiency automated detection of microscopic objects.
中文: HieraEdgeNet提出了一种多尺度边缘增强框架,通过三个协同模块解决深度学习在定位微观花粉时的局限性,在大规模数据集上实现了卓越的准确性和效率。
English: HieraEdgeNet introduces a multi-scale edge-enhancement framework with three synergistic modules to overcome deep learning challenges in localizing microscopic pollen, achieving superior accuracy and efficiency on a large-scale dataset.
Authors:Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, Yuxiao Dong
Abstract:
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.
Chinese: 大型语言模型正从对话任务转向现实世界的软件工程应用,而基于开源LLM构建的SWE-Dev代理,通过合成测试用例和扩展训练数据,在公开SWE代理中表现最佳,其70亿和320亿参数模型的成功率分别达到23.4%和36.6%。
English: Large language models are advancing from conversational tasks to real-world software engineering applications, and the SWE-Dev agent, built on open-source LLMs with synthesized test cases and scaled training data, achieves top performance among open SWE agents with success rates of 23.4% and 36.6% for its 7B and 32B models, respectively.
Authors:Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li
Abstract:
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .
中文摘要:SongBloom是一种创新框架,通过结合自回归草图和扩散式精修的方法来生成长篇歌曲,在保持全局连贯性与局部保真度方面表现优异,其性能超越现有方法并达到业界领先商业平台水平。
English Summary: SongBloom is a novel framework that combines autoregressive sketching and diffusion-based refinement to generate full-length songs with superior coherence and musicality, outperforming existing methods and matching state-of-the-art commercial platforms.
Authors:Roman Kyslyi, Yuliia Maksymiuk, Ihor Pysmennyi
Abstract:
In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: https://github.com/woters/vuyko-hutsul
本文首次将大语言模型适配于低资源的胡楚尔方言,通过构建平行语料库和采用检索增强生成技术扩充数据,实验表明微调后的小型模型在翻译任务中优于GPT-4o。
This paper presents the first adaptation of large language models to the low-resource Hutsul dialect by creating parallel datasets and employing a RAG pipeline for data augmentation, with fine-tuned small models outperforming GPT-4o in translation tasks.
Authors:Fabian Lander, Diaaeldin Taha
Abstract:
We present an effective method for visualizing flat surfaces using ray marching. Our approach provides an intuitive way to explore translation surfaces, mirror rooms, unfolded polyhedra, and translation prisms while maintaining computational efficiency. We demonstrate the utility of the method through various examples and provide implementation insights for programmers. Finally, we discuss the use of our visualizations in outreach. We make our simulations and code available online.
Chinese: 本文提出了一种利用光线行进有效可视化平坦表面的方法,能够直观地探索平移曲面和多面体等几何结构,同时保持计算效率。
English: This paper introduces an efficient ray marching method for visualizing flat surfaces, enabling intuitive exploration of geometric structures like translation surfaces and polyhedra while maintaining computational performance.
Authors:Mengsong Wu, Di Zhang, Yuqiang Li, Dongzhan Zhou, Wenliang Chen
Abstract:
While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at https://github.com/fairyshine/SELT .
中文: SELT是一种创新框架,通过将自评估与改进的蒙特卡洛树搜索相结合,无需任务特定微调即可提升大语言模型在复杂推理任务中的准确性和鲁棒性。
English: SELT is a novel framework that enhances LLM reasoning by integrating self-evaluation with a modified Monte Carlo Tree Search, improving accuracy and robustness across complex tasks without task-specific fine-tuning.
Authors:Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He
Abstract:
Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT.
中文摘要:GTR-Mol-VLM通过引入图遍历视觉思维链和“忠实识别所见”的数据中心原则,解决了光学化学结构识别中的关键难题,并借助大规模数据集和新基准实现了卓越性能。
English Summary: GTR-Mol-VLM introduces a novel framework with Graph Traversal as Visual Chain of Thought and a data-centric principle to overcome limitations in optical chemical structure recognition, achieving superior performance through a large-scale dataset and new benchmark.
Authors:Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou
Abstract:
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .
中文摘要:本研究提出了一种基于大语言模型的化学智能体,通过整合137种专业工具和新型分层进化蒙特卡洛树搜索框架,实现了化学问答与发现任务的性能突破,有效解决了专业工具与大模型融合的挑战。
English Summary: This study introduces a chemistry-focused LLM agent that integrates 137 specialized tools and a novel HE-MCTS framework to significantly enhance chemical QA and discovery tasks through optimized tool utilization and self-improving data generation.
Authors:Zhangchi Zhao, Jun Shu, Deyu Meng, Zongben Xu
Abstract:
Inspired by the Kolmogorov-Arnold representation theorem, KANs offer a novel framework for function approximation by replacing traditional neural network weights with learnable univariate functions. This design demonstrates significant potential as an efficient and interpretable alternative to traditional MLPs. However, KANs are characterized by a substantially larger number of trainable parameters, leading to challenges in memory efficiency and higher training costs compared to MLPs. To address this limitation, we propose to generate weights for KANs via a smaller meta-learner, called MetaKANs. By training KANs and MetaKANs in an end-to-end differentiable manner, MetaKANs achieve comparable or even superior performance while significantly reducing the number of trainable parameters and maintaining promising interpretability. Extensive experiments on diverse benchmark tasks, including symbolic regression, partial differential equation solving, and image classification, demonstrate the effectiveness of MetaKANs in improving parameter efficiency and memory usage. The proposed method provides an alternative technique for training KANs, that allows for greater scalability and extensibility, and narrows the training cost gap with MLPs stated in the original paper of KANs. Our code is available at https://github.com/Murphyzc/MetaKAN.
中文: MetaKANs通过引入元学习器为KANs生成权重,在多种任务中显著减少了可训练参数和训练成本,同时保持或提升了性能与可解释性。
English: MetaKANs introduce a meta-learner to generate weights for Kolmogorov-Arnold Networks (KANs), significantly reducing trainable parameters and training costs while maintaining or enhancing performance and interpretability across various tasks.
Authors:Weiqiang Jin, Hongyang Du, Guizhong Liu, Dong In Kim
Abstract:
Multi-agent reinforcement learning (MARL) has achieved strong performance in cooperative adversarial tasks. However, most existing methods typically train agents against fixed opponent strategies and rely on such meta-static difficulty conditions, which limits their adaptability to changing environments and often leads to suboptimal policies. Inspired by the success of curriculum learning (CL) in supervised tasks, we propose a dynamic CL framework for MARL that employs an self-adaptive difficulty adjustment mechanism. This mechanism continuously modulates opponent strength based on real-time agent training performance, allowing agents to progressively learn from easier to more challenging scenarios. However, the dynamic nature of CL introduces instability due to nonstationary environments and sparse global rewards. To address this challenge, we develop a Counterfactual Group Relative Policy Advantage (CGRPA), which is tightly coupled with the curriculum by providing intrinsic credit signals that reflect each agent's impact under evolving task demands. CGRPA constructs a counterfactual advantage function that isolates individual contributions within group behavior, facilitating more reliable policy updates throughout the curriculum. CGRPA evaluates each agent's contribution through constructing counterfactual action advantage function, providing intrinsic rewards that enhance credit assignment and stabilize learning under non-stationary conditions. Extensive experiments demonstrate that our method improves both training stability and final performance, achieving competitive results against state-of-the-art methods. The code is available at https://github.com/NICE-HKU/CL2MARL-SMAC.
中文摘要:本文提出一种多智能体强化学习的动态课程学习框架,通过自适应调整对手难度并采用反事实优势函数,在非平稳环境中稳定训练并提升智能体性能。
English Summary: This paper introduces a dynamic curriculum learning framework for multi-agent reinforcement learning that adaptively adjusts opponent difficulty and employs a counterfactual advantage function to stabilize training and enhance agent performance in non-stationary environments.
Authors:Hongyu Wang, Chuyan Xiong, Ruiping Wang, Xilin Chen
Abstract:
Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.
中文:BitVLA推出了首个用于机器人技术的1比特视觉语言动作模型,通过三值参数和蒸馏感知压缩技术实现了顶尖效率,同时在操作任务上保持优异性能。
English: BitVLA introduces the first 1-bit Vision-Language-Action model for robotics, achieving state-of-the-art efficiency with ternary parameters and distillation-aware compression while maintaining competitive performance on manipulation tasks.
Authors:Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu
Abstract:
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration.
中文: LeVo是一种基于语言模型的创新框架,通过建模混合和双轨token实现人声与伴奏的和谐,并采用多偏好对齐方法提升音乐性和指令跟随能力,在评估中优于现有方法。
English: LeVo is an innovative LM-based framework that enhances lyrics-to-song generation by modeling mixed and dual-track tokens for vocal-instrument harmony and employs a multi-preference alignment method to improve musicality and instruction following, outperforming existing methods in evaluations.
Authors:Shuqiang Zhang, Yuchao Zhang, Jinkun Chen, Haochen Sui
Abstract:
Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at https://github.com/WallaceSUI/kdd25-background-variable.
中文: 本文针对推荐系统中的选择偏差问题,提出了一种基于似然最大化的新型学习算法,通过蒙特卡洛方法建模潜在外生变量,并在合成与真实数据集上验证了其有效性。
English: This paper addresses selection bias in recommendation systems by proposing a novel learning algorithm that models latent exogenous variables using likelihood maximization and a Monte Carlo estimation method, demonstrating effectiveness through synthetic and real-world datasets.
Authors:Solee Im, Wonjun Lee, Jinmyeong An, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Abstract:
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: https://github.com/solee0022/deragec
Chinese: DeRAGEC通过合成去噪原理过滤噪声命名实体候选,并利用语音相似性和上下文学习进行优化,无需额外训练即可显著降低ASR系统的词错率,实现28%的相对改进。
English: DeRAGEC enhances Named Entity correction in ASR systems by filtering noisy candidates with synthetic denoising rationales and refining them through phonetic similarity and in-context learning, achieving a 28% relative WER reduction without extra training.
Authors:Shoon Kit Lim, Melissa Jia Ying Chong, Jing Huey Khor, Ting Yang Ling
Abstract:
Recent advances in agentic and physical artificial intelligence (AI) have largely focused on ground-based platforms such as humanoid and wheeled robots, leaving aerial robots relatively underexplored. Meanwhile, state-of-the-art unmanned aerial vehicle (UAV) multimodal vision-language systems typically rely on closed-source models accessible only to well-resourced organizations. To democratize natural language control of autonomous drones, we present an open-source agentic framework that integrates PX4-based flight control, Robot Operating System 2 (ROS 2) middleware, and locally hosted models using Ollama. We evaluate performance both in simulation and on a custom quadcopter platform, benchmarking four large language model (LLM) families for command generation and three vision-language model (VLM) families for scene understanding.
Chinese: 近期智能体和物理人工智能的进展主要集中于地面机器人,而空中机器人领域相对滞后且多依赖闭源模型,为此我们提出了一个开源框架,通过整合飞行控制、中间件和本地托管模型,旨在实现自主无人机的自然语言控制民主化。
English: Recent progress in agentic and physical AI has primarily targeted ground-based robots, while aerial systems remain underdeveloped and often rely on proprietary models, prompting the introduction of an open-source framework for democratizing natural language control of autonomous drones through integrated flight control, middleware, and locally hosted models.
Authors:Haotian Guo, Jing Han, Yongfeng Tu, Shihao Gao, Shengfan Shen, Wulong Xiang, Weihao Gan, Zixing Zhang
Abstract:
Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker's true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: https://github.com/SmileHnu/DEBATE.
Chinese: 本文推出了首个中文语音-文本数据集DEBATE,旨在研究语音特征如何消解文本歧义,揭示了人类与机器在口语意图理解上的显著性能差距。
English: This paper introduces DEBATE, the first public Chinese speech-text dataset designed to explore how speech cues resolve textual ambiguity, revealing significant performance gaps between human and machine understanding of spoken intent.
Authors:Libo Wang
Abstract:
In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer's ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.
中文: 本研究提出因果演化图(GoCE)以解决链式模型中因因果掩码导致长程依赖丢失的问题,通过将隐式表征映射为稀疏因果邻接矩阵并结合因果注意力机制,增强了Transformer捕捉长程因果依赖与自我演化的能力,其性能优于基线模型。
English: This work introduces the Graph of Causal Evolution (GoCE) to address the loss of long-range dependencies in chain-of-model architectures by mapping token representations into a sparse causal adjacency matrix and integrating causal constraints through attention mechanisms, ultimately enhancing transformers' ability to capture causal relationships and self-evolve beyond baseline models.
Authors:Dasol Hong, Wooju Lee, Hyun Myung
Abstract:
Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.
Chinese: 提示调优通过仅优化提示来有效适配视觉语言模型,但面临特征错位和泛化难题;提出的CoCoA-Mix方法结合混淆感知损失和权重,在提升专业化和泛化能力方面均优于现有最优方法。
English: Prompt tuning effectively adapts vision-language models by optimizing only the prompt, but faces challenges with misaligned features and generalization; the proposed CoCoA-Mix method, incorporating confusion-aware loss and weights, enhances both specialization and generalization, outperforming state-of-the-art approaches.
Authors:Haoyuan Li, Rui Zhang, Snigdha Chaturvedi
Abstract:
Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.
中文: FairPO是一种新颖的偏好调优方法,通过生成扰动偏好对并动态调整其权重,在保持摘要质量的同时显著提升了多文档摘要的摘要级和语料库级公平性,性能优于现有基线。
English: FairPO is a novel preference tuning method that enhances both summary-level and corpus-level fairness in multi-document summarization by generating perturbed preference pairs and dynamically adjusting their weights, outperforming baselines while preserving summary quality.
Authors:Thomas Zhu, Joshua Clune, Jeremy Avigad, Albert Qiaochu Jiang, Sean Welleck
Abstract:
Neural methods are transforming automated reasoning for proof assistants, yet integrating these advances into practical verification workflows remains challenging. Hammers are tools that interface with external automatic theorem provers to automate tedious reasoning steps. They have dramatically improved productivity in proof assistants, but the Lean proof assistant still does not have a hammer despite its growing popularity. We present LeanHammer, the first end-to-end domain-general hammer for Lean, built on a novel neural premise selection system for a hammer in dependent type theory. Unlike existing Lean premise selectors, our approach dynamically adapts to user-specific contexts and combines with symbolic proof search and reconstruction to create a practical hammer. With comprehensive evaluations, we show that our premise selector enables LeanHammer to solve 21\% more goals relative to existing premise selectors, and generalize well to diverse domains. Our work bridges the gap between neural retrieval and symbolic reasoning, making formal verification more accessible to researchers and practitioners.
中文摘要:LeanHammer是首个面向Lean证明助手的端到端通用锤子系统,其采用新型神经前提选择技术,能动态适应用户上下文并与符号证明搜索相结合,显著提升了定理自动证明能力。
English Summary: LeanHammer is the first comprehensive hammer for the Lean proof assistant, featuring a novel neural premise selection system that adapts to user contexts and integrates with symbolic proof search to significantly enhance automated theorem proving capabilities.
Authors:Jie Peng, Hongwei Yang, Jing Zhao, Hengji Dong, Hui He, Weizhe Zhang, Haoyu He
Abstract:
Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available at https://github.com/JiePeng104/TSC.
中文: 提出的两阶段对称连通性(TSC)防御方法利用少量干净样本和理论原理,有效清除受后门攻击的模型,在监督学习和自监督学习框架中均展现出强大的防御性能。
English: The proposed Two-stage Symmetry Connectivity (TSC) defense effectively purifies backdoor-infected models using minimal clean samples and theoretical principles, demonstrating robust performance in both supervised and self-supervised learning frameworks.
Authors:Vahid Azizi, Fatemeh Koochaki
Abstract:
Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\href{https://github.com/VahidAz/LlamaRec-LKG-RAG}{repository}.
中文摘要:本文提出LlamaRec-LKG-RAG框架,通过将个性化知识图谱融入基于大语言模型的推荐系统,利用结构化推理显著提升了推荐排序性能。
English Summary: The paper introduces LlamaRec-LKG-RAG, a novel framework that integrates personalized knowledge graphs into LLM-based recommender systems, demonstrating improved ranking performance through structured reasoning.
Authors:Alexander Kolpakov, Igor Rivin
Abstract:
Computing classical centrality measures such as betweenness and closeness is computationally expensive on large-scale graphs. In this work, we introduce an efficient force layout algorithm that embeds a graph into a low-dimensional space, where the radial distance from the origin serves as a proxy for various centrality measures. We evaluate our method on multiple graph families and demonstrate strong correlations with degree, PageRank, and paths-based centralities. As an application, it turns out that the proposed embedding allows to find high-influence nodes in a network, and provides a fast and scalable alternative to the standard greedy algorithm.
中文: 本研究提出一种高效的力导向布局算法,将图嵌入低维空间,利用径向距离作为中心性度量的替代指标,为识别网络中有影响力的节点提供了一种快速且可扩展的解决方案。
English: The study presents an efficient force layout algorithm that embeds graphs into low-dimensional space, using radial distance as a proxy for centrality measures and offering a scalable alternative for identifying influential nodes.
Authors:Janghyeon Yun, Sang-goo Lee
Abstract:
Text-to-SQL enables non-experts to retrieve data from databases by converting natural language queries into SQL. However, state-of-the-art text-to-SQL studies rely on the BIRD dataset, which assumes that evidence is provided along with questions. Although BIRD facilitates research advancements, it assumes that users have expertise and domain knowledge, contradicting the fundamental goal of text-to-SQL. In addition, human-generated evidence in BIRD contains defects, including missing or erroneous evidence, which affects model performance. To address this issue, we propose SEED (System for Evidence Extraction and Domain knowledge generation), an approach that automatically generates evidence to improve performance and practical usability in real-world scenarios. SEED systematically analyzes database schema, description files, and values to extract relevant information. We evaluated SEED on BIRD and Spider, demonstrating that it significantly improves SQL generation accuracy in the no-evidence scenario, and in some cases, even outperforms the setting where BIRD evidence is provided. Our results highlight that SEED-generated evidence not only bridges the gap between research and real-world deployment but also improves the adaptability and robustness of text-to-SQL models. Our code is available at https://github.com/felix01189/SEED
中文: SEED系统通过自动分析数据库结构生成证据,显著提升了无证据场景下文本转SQL的准确性,增强了模型在实际应用中的适应性和鲁棒性。
English: SEED is an automated evidence generation system that enhances text-to-SQL model performance by analyzing database components, achieving higher accuracy in no-evidence scenarios and improving real-world applicability.
Authors:Changsheng Gao, Wei Zhou, Guosheng Lin, Weisi Lin
Abstract:
The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process inevitably introduces semantic degradation that is difficult to quantify with traditional metrics. To address this, we formalize the research problem of Compressed Feature Quality Assessment (CFQA), aiming to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance degradation is provided as true semantic distortion for evaluating CFQA metrics. We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation. Our findings demonstrate the representativeness of the proposed dataset while underscoring the need for more sophisticated metrics capable of measuring semantic distortion in compressed features. This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA. To foster further research, we release the dataset and all associated source code at https://github.com/chansongoal/Compressed-Feature-Quality-Assessment.
中文: 本研究提出了压缩特征质量评估(CFQA)基准,旨在量化压缩特征的语义保真度,通过构建首个包含多任务多编码器的数据集,揭示了现有评估指标的不足,并为该领域研究提供了基础资源。
English: This study introduces the Compressed Feature Quality Assessment (CFQA) benchmark to evaluate the semantic fidelity of compressed features, highlighting the limitations of current metrics and providing a dataset and tools for advancing research in this area.
Authors:Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu
Abstract:
The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.
中文总结:MedChat是一个多智能体诊断框架,通过结合专业视觉模型与角色定制的大语言模型,提高了青光眼检测的可靠性,减少幻觉现象,并提供交互式临床报告功能。
English Summary: MedChat is a multi-agent diagnostic framework that integrates specialized vision models with role-specific LLMs to enhance glaucoma detection reliability, reduce hallucinations, and provide interactive clinical reporting.
Authors:Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan
Abstract:
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.
中文: G-Memory作为一种层次化、智能的记忆系统,通过三层图结构管理多智能体系统的交互,无需修改原有框架即可显著提升任务成功率和准确性。
English: G-Memory is a hierarchical, agentic memory system designed to enhance multi-agent systems by managing interactions through a three-tier graph structure, which significantly improves task success rates and accuracy without altering existing frameworks.
Authors:Xin-Cheng Wen, Yijun Yang, Cuiyun Gao, Yang Xiao, Deheng Ye
Abstract:
Large language models (LLMs) demonstrate considerable proficiency in numerous coding-related tasks; however, their capabilities in detecting software vulnerabilities remain limited. This limitation primarily stems from two factors: (1) the absence of reasoning data related to vulnerabilities, which hinders the models' ability to capture underlying vulnerability patterns; and (2) their focus on learning semantic representations rather than the reason behind them, thus failing to recognize semantically similar vulnerability samples. Furthermore, the development of LLMs specialized in vulnerability detection is challenging, particularly in environments characterized by the scarcity of high-quality datasets. In this paper, we propose a novel framework ReVD that excels at mining vulnerability patterns through reasoning data synthesizing and vulnerability-specific preference optimization. Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data. Moreover, we design the triplet supervised fine-tuning followed by curriculum online preference optimization for enabling ReVD to better understand vulnerability patterns. The extensive experiments conducted on PrimeVul and SVEN datasets demonstrate that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24\%-22.77\% improvement in the accuracy. The source code and data are available at https://github.com/Xin-Cheng-Wen/PO4Vul.
中文: 大语言模型在软件漏洞检测方面存在局限,主要因缺乏推理数据和语义学习,而提出的ReVD框架通过合成推理过程和优化偏好,显著提升了检测准确率,创下最新性能记录。
English: Large language models struggle with software vulnerability detection due to lacking reasoning data and semantic focus, but the proposed ReVD framework overcomes this by synthesizing reasoning processes and optimizing preferences, achieving state-of-the-art performance with significant accuracy improvements.
Authors:Bruno Moreira Coimbra, Marco Mambelli
Abstract:
GlideinWMS has been one of the first middleware in the WLCG community to transition from X.509 to support also tokens. The first step was to get from the prototype in 2019 to using tokens in production in 2022. This paper will present the challenges introduced by the wider adoption of tokens and the evolution plans for securing the pilot infrastructure of GlideinWMS and supporting the new requirements. In the last couple of years, the GlideinWMS team supported the migration of experiments and resources to tokens. Inadequate support in the current infrastructure, more stringent requirements, and the higher spatial and temporal granularity forced GlideinWMS to revisit once more how credentials are generated, used, and propagated. The new credential modules have been designed to be used in multiple systems (GlideinWMS, HEPCloud) and use a model where credentials have type, purpose, and different flows. Credentials are dynamically generated in order to customize the duration and limit the scope to the targeted resource. This allows to enforce the least privilege principle. Finally, we also considered adding credential storage, renewal, and invalidation mechanisms within the GlideinWMS infrastructure to better serve the experiments' needs.
中文: GlideinWMS 已从 X.509 过渡到支持令牌认证,通过设计动态凭据模块解决了基础设施不足和更严格需求等挑战,实施最小权限原则并集成存储和续订机制。
English: GlideinWMS has evolved from X.509 to token-based authentication, addressing challenges like infrastructure limitations and stricter requirements by developing dynamic credential modules that enforce least privilege and include storage and renewal mechanisms.
Authors:Jiaying He, Yitong Lin, Jiahe Chen, Honghui Xu, Jianwei Zheng
Abstract:
For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To tackle this issue, we introduce C3S3, a novel semi-supervised segmentation model that synergistically integrates complementary competition and contrastive selection. This design significantly sharpens boundary delineation and enhances overall precision. Specifically, we develop an Outcome-Driven Contrastive Learning module dedicated to refining boundary localization. Additionally, we incorporate a Dynamic Complementary Competition module that leverages two high-performing sub-networks to generate pseudo-labels, thereby further improving segmentation quality. The proposed C3S3 undergoes rigorous validation on two publicly accessible datasets, encompassing the practices of both MRI and CT scans. The results demonstrate that our method achieves superior performance compared to previous cutting-edge competitors. Especially, on the 95HD and ASD metrics, our approach achieves a notable improvement of at least 6%, highlighting the significant advancements. The code is available at https://github.com/Y-TARL/C3S3.
中文: C3S3模型通过融合结果驱动的对比学习和动态互补竞争机制,有效解决了半监督医学图像分割中的边界模糊问题,在MRI和CT数据集的关键指标上实现了超过6%的性能提升。
English: The C3S3 model addresses boundary imprecision in semi-supervised medical image segmentation by integrating outcome-driven contrastive learning and dynamic complementary competition, achieving over 6% improvement on key metrics across MRI and CT datasets.
Authors:Chengchao Shen, Dawei Liu, Jianxin Wang
Abstract:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.
中文: 本文提出的多对象拼接(MOS)方法通过将单对象图像合成为多对象图像,无需人工标注即可提供额外对象对应关系,从而在多对象图像的无监督表征学习上取得领先性能。
English: The proposed Multiple Object Stitching (MOS) method enhances unsupervised representation learning for multi-object images by synthesizing them from single-object images, providing additional object correspondences without human annotation and achieving superior performance on various datasets.
Authors:Weijie Guan, Haohui Wang, Jian Kang, Lihui Liu, Dawei Zhou
Abstract:
Graph learning has been crucial to many real-world tasks, but they are often studied with a closed-world assumption, with all possible labels of data known a priori. To enable effective graph learning in an open and noisy environment, it is critical to inform the model users when the model makes a wrong prediction to in-distribution data of a known class, i.e., misclassification detection or when the model encounters out-of-distribution from novel classes, i.e., out-of-distribution detection. This paper introduces Evidential Reasoning Network (EVINET), a framework that addresses these two challenges by integrating Beta embedding within a subjective logic framework. EVINET includes two key modules: Dissonance Reasoning for misclassification detection and Vacuity Reasoning for out-of-distribution detection. Extensive experiments demonstrate that EVINET outperforms state-of-the-art methods across multiple metrics in the tasks of in-distribution classification, misclassification detection, and out-of-distribution detection. EVINET demonstrates the necessity of uncertainty estimation and logical reasoning for misclassification detection and out-of-distribution detection and paves the way for open-world graph learning. Our code and data are available at https://github.com/SSSKJ/EviNET.
中文: EVINET框架通过将Beta嵌入与主观逻辑相结合,有效解决了图学习中的误分类和分布外检测问题,在多项指标上表现优异,推动了开放世界图学习的发展。
English: EVINET is a novel framework that tackles misclassification and out-of-distribution detection in graph learning by integrating Beta embeddings with subjective logic, demonstrating superior performance across multiple metrics and advancing open-world graph learning.
Authors:Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, Carlos Gómez-RodrÃguez
Abstract:
Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaranà data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaranà UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject
中文:BiLingua Parser 是一种基于大语言模型的注释流程,能有效为语码转换文本生成通用依存关系标注,经专家修订后达到 95.29% 的LAS值,在低资源环境中显著优于现有解析器。
English: The BiLingua Parser, an LLM-based annotation pipeline, effectively generates Universal Dependencies annotations for code-switched text, achieving up to 95.29% LAS after expert revision and outperforming existing parsers in low-resource settings.
Authors:Nada Aboudeshish, Dmitry Ignatov, Radu Timofte
Abstract:
Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Random cropping ensures the preservation of spatio-temporal integrity while addressing challenges such as viewpoint bias and occlusions. The augmentation pipeline generates three augmented versions for each sample in addition to the data set sample, thus quadrupling the data set size and enriching the diversity of gesture representations. The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network. Experiments are conducted on benchmark datasets including DHG14/28, SHREC'17, and JHMDB. The e2eET model, recognized as the state-of-the-art for hand gesture recognition on DHG14/28 and SHREC'17. The FPPR-PCD model, the second-best performing model on SHREC'17, excels in point cloud-based gesture recognition. DD-Net, a lightweight and efficient architecture for skeleton-based action recognition, is evaluated on SHREC'17 and the Human Motion Data Base (JHMDB). The results underline the effectiveness and versatility of the proposed augmentation strategy, significantly improving model generalization and robustness across diverse datasets and architectures. This framework not only establishes state-of-the-art results on all three evaluated models but also offers a scalable solution to advance HGR and action recognition applications in real-world scenarios. The framework is available at https://github.com/NadaAbodeshish/Random-Cropping-augmentation-HGR
本文提出了一种全面的数据增强框架,通过几何和强度变换使骨架手势识别数据集多样性翻两番,在多个模型和基准数据集上均取得了最优性能。
This paper introduces a comprehensive data augmentation framework that enhances skeleton-based gesture recognition by quadrupling dataset diversity through geometric and intensity transformations, achieving state-of-the-art results across multiple models and datasets.
Authors:Tianci Bu, Chuanrui Wang, Hao Ma, Haoren Zheng, Xin Lu, Tailin Wu
Abstract:
Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbf{GGBall}, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75\% on Community-Small and over 40\% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \href{https://github.com/AI4Science-WestlakeU/GGBall}{here}.
中文摘要:GGBall提出了一种基于双曲几何的图生成框架,通过结合双曲向量量化自编码器和黎曼流匹配技术,能更好地捕捉层次结构,在保持拓扑特性方面显著优于现有方法。
English Summary: GGBall introduces a hyperbolic geometry-based framework for graph generation, combining a Hyperbolic Vector-Quantized Autoencoder with Riemannian flow matching to better capture hierarchical structures and significantly outperform existing methods in preserving topological properties.
Authors:Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, Tong Xu
Abstract:
Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO's capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.
中文: AMoPO提出了一种无需辅助奖励模型即可动态平衡大语言模型中多维度偏好的新框架,相比现有方法性能提升28.5%,并在不同规模模型上展现出优秀的扩展能力。
English: AMoPO introduces a novel framework that dynamically balances multiple preference dimensions in large language models without requiring auxiliary reward models, achieving a 28.5% performance improvement over existing methods while demonstrating strong scalability across different model sizes.
Authors:Van Nguyen Nguyen, Christian Forster, Sindi Shkodrani, Vincent Lepetit, Bugra Tekin, Cem Keskin, Tomas Hodan
Abstract:
We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an analysis-by-synthesis approach for model-to-frame registration, GoTrack additionally integrates frame-to-frame registration, which saves compute and stabilizes tracking. Both types of registration are realized by optical flow estimation. The model-to-frame registration is noticeably simpler than in existing methods, relying only on standard neural network blocks (a transformer is trained on top of DINOv2) and producing reliable pose confidence scores without a scoring network. For the frame-to-frame registration, which is an easier problem as consecutive video frames are typically nearly identical, we employ a light off-the-shelf optical flow model. We demonstrate that GoTrack can be seamlessly combined with existing coarse pose estimation methods to create a minimal pipeline that reaches state-of-the-art RGB-only results on standard benchmarks for 6DoF object pose estimation and tracking. Our source code and trained models are publicly available at https://github.com/facebookresearch/gotrack
GoTrack是一种高效的基于CAD的六自由度物体姿态优化与跟踪方法,通过光流估计结合模型到帧和帧到帧的配准,无需物体特定训练即可达到最先进的性能。
GoTrack is an efficient CAD-based method for 6DoF object pose refinement and tracking that integrates both model-to-frame and frame-to-frame registration through optical flow estimation, achieving state-of-the-art results without object-specific training.
Authors:Xintao Yan, Erdao Liang, Jiawei Wang, Haojie Zhu, Henry X. Liu
Abstract:
Datasets pertaining to autonomous vehicles (AVs) hold significant promise for a range of research fields, including artificial intelligence (AI), autonomous driving, and transportation engineering. Nonetheless, these datasets often encounter challenges related to the states of traffic signals, such as missing or inaccurate data. Such issues can compromise the reliability of the datasets and adversely affect the performance of models developed using them. This research introduces a fully automated approach designed to tackle these issues by utilizing available vehicle trajectory data alongside knowledge from the transportation domain to effectively impute and rectify traffic signal information within the Waymo Open Motion Dataset (WOMD). The proposed method is robust and flexible, capable of handling diverse intersection geometries and traffic signal configurations in real-world scenarios. Comprehensive validations have been conducted on the entire WOMD, focusing on over 360,000 relevant scenarios involving traffic signals, out of a total of 530,000 real-world driving scenarios. In the original dataset, 71.7% of traffic signal states are either missing or unknown, all of which were successfully imputed by our proposed method. Furthermore, in the absence of ground-truth signal states, the accuracy of our approach is evaluated based on the rate of red-light violations among vehicle trajectories. Results show that our method reduces the estimated red-light running rate from 15.7% in the original data to 2.9%, thereby demonstrating its efficacy in rectifying data inaccuracies. This paper significantly enhances the quality of AV datasets, contributing to the wider AI and AV research communities and benefiting various downstream applications. The code and improved traffic signal data are open-sourced at https://github.com/michigan-traffic-lab/WOMD-Traffic-Signal-Data-Improvement
中文: 本研究提出一种自动化方法,利用车辆轨迹数据和交通领域知识对Waymo开放运动数据集中的交通信号状态进行有效填补与修正,将闯红灯率从15.7%降至2.9%,显著提升了自动驾驶数据集的可靠性。
English: This study presents an automated method that enhances the Waymo Open Motion Dataset by accurately imputing missing traffic signal states using vehicle trajectory data and transportation knowledge, reducing red-light violations from 15.7% to 2.9% and improving dataset reliability for autonomous vehicle research.
Authors:Hao Tang, Chengchao Shen
Abstract:
Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.
中文摘要:本文提出空间令牌融合方法,通过压缩视觉令牌并补充多粒度特征来降低大型多模态模型的计算成本,仅用基线25%的令牌即可实现相当甚至更优的性能。
English Summary: The paper introduces a Spatial Token Fusion method to reduce computational costs in large multimodal models by compressing vision tokens and enhancing them with multi-granularity features, achieving comparable performance with only 25% of baseline tokens.
Authors:Changhong Fu, Hua Lin, Haobo Zuo, Liangliang Yao, Liguo Zhang
Abstract:
Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multi-scale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with catmull-rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo will be available at https://github.com/vision4robotics/EdgeSpotter.
中文: 本文提出EdgeSpotter多尺度密集文本识别器,通过新型Transformer和特征采样方法解决工业仪表复杂文本检测难题,并基于新建数据集和边缘AI系统验证了其优越性能。
English: This paper introduces EdgeSpotter, a multi-scale dense text spotter using a novel Transformer and feature sampling method to overcome challenges in industrial panel text recognition, validated by a new benchmark dataset and edge AI system tests.
Authors:Rong-Xi Tan, Ming Chen, Ke Xue, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian
Abstract:
The pursuit of universal black-box optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in language models (LMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided at https://github.com/lamda-bbo/universal-offline-bbo.
Chinese: 语言模型的最新进展通过利用其统一嵌入和学习框架,为克服异构数值空间中的传统障碍,实现通用黑盒优化提供了有希望的途径。
English: Recent advances in language models provide a promising path toward universal black-box optimization by leveraging their unified embeddings and learning frameworks to overcome traditional barriers in heterogeneous numerical spaces.
Authors:Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin
Abstract:
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.
中文:定理思维(ToTh)框架通过将推理建模为三个采用不同推理模式的智能体协作过程,将其输出构建为推理图并通过贝叶斯信念传播评估一致性,从而优于现有方法并提供可解释、逻辑严密的推理结果。
English: The Theorem-of-Thought (ToTh) framework enhances LLM reasoning by modeling it as a collaborative process among three agents using distinct inference modes, structuring their outputs into reasoning graphs evaluated for coherence via Bayesian belief propagation, which outperforms existing methods and provides interpretable, logically grounded results.
Authors:Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
Abstract:
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments reveal that CoFILL's noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in imputation accuracy. The source code is publicly available at https://github.com/joyHJL/CoFILL.
Chinese Summary: 针对时空数据修复中复杂依赖关系和误差传播的难题,CoFILL创新性地采用条件扩散模型与双流架构,通过并行处理时域和频域特征实现高质量数据补全,在修复精度上显著超越现有最优方法。
English Summary: Spatiotemporal data imputation is challenging due to complex interdependencies and error propagation in existing methods, but CoFILL, a novel conditional diffusion model with dual-stream architecture, overcomes these limitations by generating high-quality imputations that outperform state-of-the-art approaches.
Authors:Kai Xiong, Xiao Ding, Yixin Cao, Yuxiong Yan, Li Du, Yufei Zhang, Jinglong Gao, Jiaqian Liu, Bing Qin, Ting Liu
Abstract:
Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.
中文摘要:大型语言模型擅长处理简单常识推理,但在复杂隐性知识方面表现不足,为此提出的Com²基准通过结构化因果推理和慢思考方法,旨在评估并提升其在此类场景下的能力。
English Summary: Large language models excel at simple commonsense reasoning but struggle with complex, implicit scenarios, prompting the creation of the Com² benchmark to evaluate and improve their performance in this area through structured causal reasoning and slow-thinking methodologies.
Authors:Dongryung Lee, Sejune Joo, Kimin Lee, Beomjoon Kim
Abstract:
The problem of relocating a set of objects to designated areas amidst movable obstacles can be framed as a Geometric Task and Motion Planning (G-TAMP) problem, a subclass of task and motion planning (TAMP). Traditional approaches to G-TAMP have relied either on domain-independent heuristics or on learning from planning experience to guide the search, both of which typically demand significant computational resources or data. In contrast, humans often use common sense to intuitively decide which objects to manipulate in G-TAMP problems. Inspired by this, we propose leveraging Large Language Models (LLMs), which have common sense knowledge acquired from internet-scale data, to guide task planning in G-TAMP problems. To enable LLMs to perform geometric reasoning, we design a predicate-based prompt that encodes geometric information derived from a motion planning algorithm. We then query the LLM to generate a task plan, which is then used to search for a feasible set of continuous parameters. Since LLMs are prone to mistakes, instead of committing to LLM's outputs, we extend Monte Carlo Tree Search (MCTS) to a hybrid action space and use the LLM to guide the search. Unlike the previous approach that calls an LLM at every node and incurs high computational costs, we use it to warm-start the MCTS with the nodes explored in completing the LLM's task plan. On six different G-TAMP problems, we show our method outperforms previous LLM planners and pure search algorithms. Code can be found at: https://github.com/iMSquared/prime-the-search
中文: 本研究提出一种创新方法,利用具备常识的大型语言模型指导几何任务与运动规划中的任务规划,通过谓词提示进行几何推理,并将模型输出与蒙特卡洛树搜索结合以高效求解,在六类问题中表现优于现有方法。
English: This study introduces a novel approach that leverages Large Language Models (LLMs) with common sense to guide task planning in Geometric Task and Motion Planning (G-TAMP), using a predicate-based prompt for geometric reasoning and integrating LLM outputs with Monte Carlo Tree Search (MCTS) to efficiently find feasible solutions, outperforming prior methods in six G-TAMP problems.
Authors:Zheng Wang, Kai Ying, Bin Xu, Chunjiao Wang, Cong Bai
Abstract:
Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, infrared-based full-disc precipitation retrievals beyond the scanning swath. We introduce Multimodal Knowledge Expansion, a two-stage pipeline with the proposed PRE-Net model. In the Swath-Distilling stage, PRE-Net transfers knowledge from a multimodal data integration model to an infrared-based model within the scanning swath via Coordinated Masking and Wavelet Enhancement (CoMWE). In the Full-Disc Adaptation stage, Self-MaskTune refines predictions across the full disc by balancing multimodal and full-disc infrared knowledge. Experiments on the introduced PRE benchmark demonstrate that PRE-Net significantly advanced precipitation retrieval performance, outperforming leading products like PERSIANN-CCS, PDIR, and IMERG. The code will be available at https://github.com/Zjut-MultimediaPlus/PRE-Net.
中文摘要:PRE-Net模型通过多模态知识扩展的两阶段流程,在扫描范围内通过协同掩蔽与小波增强实现知识迁移,并在全圆盘范围内通过自掩蔽调优平衡多模态与红外数据,显著提升了基于红外的全圆盘降水反演精度。
English Summary: The PRE-Net model introduces a two-stage multimodal knowledge expansion approach that significantly improves infrared-based full-disc precipitation retrieval accuracy by transferring knowledge from multimodal data and refining predictions across the entire observation area.
Authors:Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
Abstract:
As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.
中文: AlphaSteer是一种基于理论的激活引导方法,通过为恶意提示学习拒绝向量来增强大语言模型安全性,同时利用近零向量保持良性输入的实用性,有效平衡了安全性与性能。
English: AlphaSteer is a theoretically grounded activation steering method that enhances LLM safety by learning refusal vectors for malicious prompts while preserving utility through nearly zero vectors for benign inputs, effectively balancing safety and performance.
Authors:Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Abstract:
Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.
中文摘要:本研究提出了一种融合运动学约束的物理信息扩散模型,用于检测GPS欺骗等异常轨迹,在海上和城市数据集中实现了更高的检测精度和更低的误差率。
English Summary: This study introduces a physics-informed diffusion model that incorporates kinematic constraints to detect anomalous trajectories, such as GPS spoofing, achieving higher accuracy and lower error rates in maritime and urban datasets.
Authors:Mingyi Li, Michael R. Metel, Akiko Takeda
Abstract:
The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at https://github.com/lmingyi/LO-K-means.
Chinese: 本文分析了K-means算法的局部最优性,并提出了在保持相同计算复杂度的前提下,能够确保算法在广义Bregman散度下收敛到局部最优解的改进方法。
English: This paper analyzes the local optimality of the K-means algorithm and proposes modifications that ensure convergence to locally optimal solutions under general Bregman divergence, maintaining the same computational complexity as the original algorithm.
Authors:Senqi Yang, Dongyu Zhang, Jing Ren, Ziqi Xu, Xiuzhen Zhang, Yiliao Song, Hongfei Lin, Feng Xia
Abstract:
Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at https://github.com/DUTIR-YSQ/MultiMM.
中文摘要:本文提出MultiMM多文化多模态隐喻数据集,通过提供带标注的中英文广告对解决NLP中的文化偏见问题,并开发了情感增强检测模型,有效提升了跨文化隐喻理解能力。
English Summary: This paper introduces MultiMM, a multicultural multimodal metaphor dataset addressing cultural bias in NLP by providing Chinese and English advertisement pairs with annotations, and proposes a sentiment-enhanced detection model that demonstrates improved cross-cultural metaphor understanding.
Authors:Anastasia Koloskova, Youssef Allouah, Animesh Jha, Rachid Guerraoui, Sanmi Koyejo
Abstract:
We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy concerns and regulatory requirements such as the "right to be forgotten." Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees. To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic post-processing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines. Our code is available at https://github.com/stair-lab/certified-unlearning-neural-networks-icml-2025
中文: 本文提出了一种经过认证的机器遗忘方法,通过对保留数据进行噪声微调来提供无需严格假设的形式化遗忘保证,在理论和实践中均表现出优越性能。
English: This paper introduces a certified machine unlearning method that uses noisy fine-tuning on retained data to provide formal unlearning guarantees without restrictive assumptions, demonstrating both theoretical and practical effectiveness.
Authors:Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay
Abstract:
Accurate and efficient object detection is essential for autonomous vehicles, where real-time perception requires low latency and high throughput. LiDAR sensors provide robust depth information, but conventional methods process full 360° scans in a single pass, introducing significant delay. Streaming approaches address this by sequentially processing partial scans in the native polar coordinate system, yet they rely on translation-invariant convolutions that are misaligned with polar geometry -- resulting in degraded performance or requiring complex distortion mitigation. Recent Mamba-based state space models (SSMs) have shown promise for LiDAR perception, but only in the full-scan setting, relying on geometric serialization and positional embeddings that are memory-intensive and ill-suited to streaming. We propose Polar Hierarchical Mamba (PHiM), a novel SSM architecture designed for polar-coordinate streaming LiDAR. PHiM uses local bidirectional Mamba blocks for intra-sector spatial encoding and a global forward Mamba for inter-sector temporal modeling, replacing convolutions and positional encodings with distortion-aware, dimensionally-decomposed operations. PHiM sets a new state-of-the-art among streaming detectors on the Waymo Open Dataset, outperforming the previous best by 10\% and matching full-scan baselines at twice the throughput. Code will be available at https://github.com/meilongzhang/Polar-Hierarchical-Mamba .
中文: 提出的极坐标分层Mamba(PHiM)架构通过用感知畸变的Mamba模块替代传统卷积,解决了流式LiDAR检测的低效问题,在Waymo开放数据集中以超越先前最佳方法10%的性能刷新了最高水平,并实现了与全扫描相当的吞吐量。
English: The proposed Polar Hierarchical Mamba (PHiM) architecture addresses inefficiencies in streaming LiDAR detection by replacing conventional convolutions with distortion-aware Mamba blocks, achieving state-of-the-art performance on the Waymo Open Dataset with a 10% improvement over prior methods and matching full-scan throughput.
Authors:Nima Jamali, Matina Mahdizadeh Sani, Hanieh Naderi, Shohreh Kasaei
Abstract:
Deep neural networks (DNNs) have demonstrated remarkable performance in analyzing 3D point cloud data. However, their vulnerability to adversarial attacks-such as point dropping, shifting, and adding-poses a critical challenge to the reliability of 3D vision systems. These attacks can compromise the semantic and structural integrity of point clouds, rendering many existing defense mechanisms ineffective. To address this issue, a defense strategy named KNN-Defense is proposed, grounded in the manifold assumption and nearest-neighbor search in feature space. Instead of reconstructing surface geometry or enforcing uniform point distributions, the method restores perturbed inputs by leveraging the semantic similarity of neighboring samples from the training set. KNN-Defense is lightweight and computationally efficient, enabling fast inference and making it suitable for real-time and practical applications. Empirical results on the ModelNet40 dataset demonstrated that KNN-Defense significantly improves robustness across various attack types. In particular, under point-dropping attacks-where many existing methods underperform due to the targeted removal of critical points-the proposed method achieves accuracy gains of 20.1%, 3.6%, 3.44%, and 7.74% on PointNet, PointNet++, DGCNN, and PCT, respectively. These findings suggest that KNN-Defense offers a scalable and effective solution for enhancing the adversarial resilience of 3D point cloud classifiers. (An open-source implementation of the method, including code and data, is available at https://github.com/nimajam41/3d-knn-defense).
中文: KNN-Defense是一种轻量高效的防御方法,通过利用训练数据的语义相似性恢复受扰动的输入,显著提升了三维点云分类器对抗多种攻击的鲁棒性,并在不同模型上实现了明显的准确率提升。
English: KNN-Defense is a lightweight and efficient method that enhances the robustness of 3D point cloud classifiers against adversarial attacks by restoring perturbed inputs using semantic similarity from training data, achieving significant accuracy improvements across various models.
Authors:Ziheng Qiao, Houquan Zhou, Zhenghua Li
Abstract:
In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at https://github.com/zhqiao-nlp/MSLLM.
中文: 本文提出一种动态混合方法,在集束搜索中结合小模型的精确修正与大语言模型的流畅性,无需微调即可实现最先进的中文拼写纠错效果。
English: This paper introduces a dynamic mixture approach that integrates small models' precision with LLMs' fluency during beam search, achieving state-of-the-art Chinese spelling correction without fine-tuning LLMs.
Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Abstract:
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at https://github.com/Div290/FREE.
Chinese: 本文提出FREE方法,通过基于GAN框架的对抗性训练在视觉语言模型中实现早期退出策略,在保持性能的同时将推理速度提升超过1.51倍。
English: The paper introduces FREE, an adversarial training method using a GAN-based framework to implement Early Exit strategies in Vision-Language Models, significantly accelerating inference by over 1.51x while maintaining performance.
Authors:Armin Behnamnia, Gholamali Aminian, Alireza Aghaei, Chengchun Shi, Vincent Y. F. Tan, Hamid R. Rabiee
Abstract:
Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret -- the performance gap between our LSE estimator and the optimal policy -- assuming bounded $(1+ε)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-ε/(1+ ε)})$ for the regret bounds, where $ε\in [0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning.
中文摘要:作者提出了一种基于对数求和指数(LSE)的新型估计器,通过减少方差并在重尾奖励分布下保持鲁棒性,有效解决了离线策略学习和评估中的关键挑战,并提供了理论保证和实证验证。
English Summary: The authors propose a novel log-sum-exponential (LSE) estimator that addresses challenges in off-policy learning and evaluation by reducing variance and demonstrating robustness under heavy-tailed reward distributions, with theoretical guarantees and empirical validation.
Authors:Ilya Kaufman Sirot, Omri Azencot
Abstract:
Deep learning models with a large number of parameters, often referred to as over-parameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel \emph{manifold learning} approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead. Code is available at https://github.com/azencot-group/CEMS.
Chinese: 过参数化深度学习模型通过数据增强等正则化技术实现良好泛化,而提出的曲率增强流形采样(CEMS)方法利用二阶流形近似,以最小计算开销显著提升回归任务的性能。
English: Over-parameterized deep learning models achieve strong generalization through regularization like data augmentation, and the proposed Curvature-Enhanced Manifold Sampling (CEMS) method leverages second-order manifold approximations to significantly improve regression performance with minimal computational cost.
Authors:Chao Yin, Hao Li, Kequan Yang, Jide Li, Pinpin Zhu, Xiaoqiang Li
Abstract:
While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit{\textbf{semantic ambiguity in getting instance-specific text prompts}}, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit{\textbf{semantic discrepancy combined with spatial separation in getting instance-specific visual prompts}}, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbf{RDVP-MSD}, a novel training-free test-time adaptation framework that synergizes \textbf{R}egion-constrained \textbf{D}ual-stream \textbf{V}isual \textbf{P}rompting (RDVP) via \textbf{M}ultimodal \textbf{S}tepwise \textbf{D}ecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \href{https://github.com/ycyinchao/RDVP-MSD}{https://github.com/ycyinchao/RDVP-MSD}
中文:提出的RDVP-MSD框架通过结合多模态逐步分解和区域约束双流视觉提示,解决了伪装物体分割中的语义模糊和空间分离问题,无需训练即可实现最先进的性能。
English: The proposed RDVP-MSD framework addresses semantic ambiguity and spatial separation in camouflaged object segmentation by combining multimodal stepwise decomposition and region-constrained dual-stream visual prompting, achieving state-of-the-art results without requiring training.
Authors:Tianjie Ju, Yujia Chen, Hao Fei, Mong-Li Lee, Wynne Hsu, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Abstract:
Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all'' strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at https://github.com/KalinaEine/PsychologicalPersuasion.
中文: 本研究评估了大语言模型的说服能力,发现明确的心理策略可提高成功率,并提出一种自适应框架训练模型自主选择最优策略,显著提升了说服效果。
English: This study evaluates large language models' persuasion capabilities, finding that explicit psychological strategies improve success rates, and proposes an adaptive framework that trains models to autonomously select optimal strategies, significantly enhancing performance.
Authors:Walter Paci, Alessandro Panunzi, Sandro Pezzelle
Abstract:
Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at https://github.com/WalterPaci/IMPAQTS-PID
中文摘要:当前大型语言模型尚缺乏准确解读政治话语中预设和言外之意等隐性内容的关键语用能力,但研究显示出未来提升模型性能的积极趋势。
English Summary: Large Language Models currently lack the pragmatic capabilities to accurately interpret implicit content like presuppositions and implicatures in political discourse, though promising trends suggest potential for future improvement.
Authors:Mohammad-Maher Nakshbandi, Ziad Sharawy, Dorian Cojocaru, Sorin Grigorescu
Abstract:
In this study, we introduce LoopDB, which is a challenging loop closure dataset comprising over 1000 images captured across diverse environments, including parks, indoor scenes, parking spaces, as well as centered around individual objects. Each scene is represented by a sequence of five consecutive images. The dataset was collected using a high resolution camera, providing suitable imagery for benchmarking the accuracy of loop closure algorithms, typically used in simultaneous localization and mapping. As ground truth information, we provide computed rotations and translations between each consecutive images. Additional to its benchmarking goal, the dataset can be used to train and fine-tune loop closure methods based on deep neural networks. LoopDB is publicly available at https://github.com/RovisLab/LoopDB.
中文: 本研究推出LoopDB数据集,包含1000多张多样化环境图像,专为SLAM中的闭环检测算法基准测试和深度神经网络训练而设计,提供真实位姿数据并公开共享。
English: This study presents LoopDB, a challenging dataset of over 1000 images across varied environments designed for benchmarking and training loop closure algorithms in SLAM applications, with ground truth data and public availability.
Authors:Nidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh, Mayank Vatsa
Abstract:
Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality-specific anti-spoofing techniques, building a unified, resource-efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a $\textbf{Li}$gh$\textbf{t}$ weight and generalizable $\textbf{M}$ulti-modal $\textbf{A}$nti-$\textbf{S}$poofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint-based biometric systems. At the core of LitMAS is a Modality-Aligned Concentration Loss, which enhances inter-class separability while preserving cross-modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state-of-the-art methods by $1.36\%$ in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at https://github.com/IAB-IITJ/LitMAS.
Chinese: LitMAS是一种轻量级、可泛化的多模态防伪框架,通过仅600万参数即可有效检测语音、人脸、虹膜和指纹生物识别系统中的欺骗攻击,在七个数据集上的平均等错误率比现有最优方法提升1.36%。
English: LitMAS is a lightweight, generalizable multi-modal anti-spoofing framework that effectively detects spoofing attacks across speech, face, iris, and fingerprint biometric systems using only 6M parameters, outperforming existing methods by 1.36% in average EER.
Authors:Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl
Abstract:
Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumption. They almost always remain overconfident in their predictions even when presented with unknown classes. This study investigates out-of-distribution (OOD) detection of wildlife, specifically the Big Five. To this end, we select a parametric Nearest Class Mean (NCM) and a non-parametric contrastive learning approach as baselines to take advantage of pretrained and projected features from popular classification encoders. Moreover, we compare our baselines to various common OOD methods in the literature. The results show feature-based methods reflect stronger generalisation capability across varying classification thresholds. Specifically, NCM with ImageNet pre-trained features achieves a 2%, 4% and 22% improvement on AUPR-IN, AUPR-OUT and AUTC over the best OOD methods, respectively. The code can be found here https://github.com/pxpana/BIG5OOD
中文摘要:本研究采用基于特征的方法探索非洲五大动物的分布外检测,相比现有方法实现了AUTC指标22%的性能提升。
English Summary: This study explores out-of-distribution detection for African Big Five wildlife using feature-based methods, demonstrating superior generalization with a 22% AUTC improvement over existing approaches.
Authors:Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li
Abstract:
Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: https://github.com/nqian1/Survey-on-ODISR-and-ODVSR.
中文: 本文系统综述了基于深度学习的全景图像与视频超分辨率方法,并引入包含真实场景退化的新数据集360Insta以弥补现有基准不足,同时提供了全面评估和未来研究方向。
English: This paper systematically reviews deep learning-based omnidirectional image and video super-resolution methods and introduces a new real-world dataset, 360Insta, to address limitations in existing benchmarks while providing comprehensive evaluations and future research directions.
Authors:Fudong Lin, Wanrou Du, Jinchan Liu, Tarikul Milon, Shelby Meche, Wu Xu, Xiaoqi Qin, Xu Yuan
Abstract:
Deep neural networks, particularly Transformers, have been widely adopted for predicting the functional properties of proteins. In this work, we focus on exploring whether Protein Transformers can capture biological intelligence among protein sequences. To achieve our goal, we first introduce a protein function dataset, namely Protein-FN, providing over 9000 protein data with meaningful labels. Second, we devise a new Transformer architecture, namely Sequence Protein Transformers (SPT), for computationally efficient protein function predictions. Third, we develop a novel Explainable Artificial Intelligence (XAI) technique called Sequence Score, which can efficiently interpret the decision-making processes of protein models, thereby overcoming the difficulty of deciphering biological intelligence bided in Protein Transformers. Remarkably, even our smallest SPT-Tiny model, which contains only 5.4M parameters, demonstrates impressive predictive accuracy, achieving 94.3% on the Antibiotic Resistance (AR) dataset and 99.6% on the Protein-FN dataset, all accomplished by training from scratch. Besides, our Sequence Score technique helps reveal that our SPT models can discover several meaningful patterns underlying the sequence structures of protein data, with these patterns aligning closely with the domain knowledge in the biology community. We have officially released our Protein-FN dataset on Hugging Face Datasets https://huggingface.co/datasets/Protein-FN/Protein-FN. Our code is available at https://github.com/fudong03/BioIntelligence.
中文: 本研究提出了Protein-FN蛋白质功能数据集、新型序列蛋白质Transformer(SPT)预测模型,以及可解释AI技术Sequence Score,揭示了SPT模型如何捕捉蛋白质序列中的生物学模式,其最小模型仅用540万参数即实现了优异预测精度。
English: This study introduces a Protein-FN dataset, a novel Sequence Protein Transformer (SPT) model for efficient protein function prediction, and an explainable AI technique called Sequence Score that reveals how SPT models capture biologically meaningful patterns in protein sequences, achieving high accuracy with minimal parameters.
Authors:Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li
Abstract:
Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models. To facilitate reproducibility and future research, we have released the code and models at https://github.com/tsinghua-fib-lab/MoveGCL.
中文:MoveGCL是一个通过生成式持续学习实现去中心化移动基础模型训练的隐私保护框架,在保护敏感数据的同时,在异构城市数据集上取得了与联合训练相当的性能表现。
English: MoveGCL is a privacy-preserving framework that enables decentralized training of mobility foundation models through generative continual learning, achieving performance comparable to joint training while protecting sensitive data across heterogeneous urban datasets.
Authors:Chunyuan Deng, Ruidi Chang, Hanjie Chen
Abstract:
Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.
中文: 本研究提出语言模型的分布级干预方法,通过扩展概念子空间的调控范围实现更精细的行为控制,在常识推理与算术推理任务中展现出比点态干预更强的可控性与鲁棒性。
English: This study introduces distribution-wise interventions for language models, which extend beyond pointwise control to learn transformations across broader concept subspaces, demonstrating superior controllability and robustness in commonsense and arithmetic reasoning benchmarks compared to traditional methods.
Authors:Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez
Abstract:
End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at https://github.com/NVlabs/GTRS.
中文摘要:GTRS是一个用于自动驾驶的统一框架,通过基于扩散的轨迹生成、词汇泛化技术和传感器增强策略,结合粗粒度与细粒度轨迹评估,有效解决了泛化能力不足的问题。
English Summary: GTRS is a unified framework for autonomous driving that combines coarse and fine-grained trajectory evaluation through diffusion-based generation, vocabulary generalization, and sensor augmentation to overcome limitations in generalization.
Authors:Nikhita Vedula, Dushyanta Dhyani, Laleh Jalali, Boris Oreshkin, Mohsen Bayati, Shervin Malmasi
Abstract:
Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at https://github.com/vnik18/llm-price-quantile-reg/ to support future research.
中文摘要:本研究提出了一种利用大语言模型的新型分位数回归方法,用于文本到分布预测任务(如价格估计),通过在多数据集和指标上的系统比较,证明了该方法在预测准确性和分布校准方面均优于传统方法。
English Summary: This study introduces a novel quantile regression method using Large Language Models (LLMs) to generate predictive distributions for text-to-distribution tasks like price estimation, demonstrating superior performance over traditional approaches through systematic comparisons across multiple datasets and metrics.
Authors:Minghao Zou, Qingtian Zeng, Yongping Miao, Shangkun Liu, Zilong Wang, Hantao Liu, Wei Zhou
Abstract:
Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.
中文摘要:PhysLab数据集通过提供620个包含多层次标注的物理实验长视频,解决了现有数据集在标注粒度、领域覆盖和过程指导方面的不足,推动了精细视觉解析与教育技术的融合发展。
English Summary: The PhysLab dataset addresses limitations in visual parsing by providing 620 long-form videos of physics experiments with multilevel annotations to advance fine-grained analysis and educational technology integration.
Authors:Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins
Abstract:
While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found at https://github.com/xinyuluo8561/Stacey .
中文: 本文提出了一种名为Stacey的加速$\ell_p$最速下降算法,通过插值原始-对偶迭代序列有效处理深度学习中的非欧几里得优化问题,在图像分类和语言模型预训练任务中展现出比现有方法更快的收敛速度和更高的最终精度。
English: This paper introduces Stacey, an accelerated $\ell_p$ steepest descent algorithm that effectively addresses non-Euclidean optimization in deep learning, demonstrating superior convergence and accuracy over existing methods in both theoretical and empirical evaluations.
Authors:Joseph T Colonel, Carolyn Hagler, Guiselle Wismer, Laura Curtis, Jacqueline Becker, Juan Wisnivesky, Alex Federman, Gaurav Pandey
Abstract:
Several machine learning algorithms have been developed for the prediction of Alzheimer's disease and related dementia (ADRD) from spontaneous speech. However, none of these algorithms have been translated for the prediction of broader cognitive impairment (CI), which in some cases is a precursor and risk factor of ADRD. In this paper, we evaluated several speech-based open-source methods originally proposed for the prediction of ADRD, as well as methods from multimodal sentiment analysis for the task of predicting CI from patient audio recordings. Results demonstrated that multimodal methods outperformed unimodal ones for CI prediction, and that acoustics-based approaches performed better than linguistics-based ones. Specifically, interpretable acoustic features relating to affect and prosody were found to significantly outperform BERT-based linguistic features and interpretable linguistic features, respectively. All the code developed for this study is available at https://github.com/JTColonel/catch.
中文: 本研究将原本用于预测阿尔茨海默病的语音机器学习方法应用于更广泛的认知障碍检测,发现多模态和声学方法,特别是捕捉情感和韵律特征的方法,优于单模态和语言学方法。
English: This study adapts existing speech-based machine learning methods, originally designed for Alzheimer's disease prediction, to detect broader cognitive impairment, finding that multimodal and acoustic approaches, particularly those capturing affect and prosody, outperform unimodal and linguistic methods.
Authors:Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer
Abstract:
A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.
中文: 该研究提出精确信息控制(PIC)任务以解决语言模型的幻觉问题,通过要求模型仅基于给定声明生成内容,发现即使先进模型仍有超过70%的幻觉率,并开发了后训练框架将模型性能提升至91% F1分数,显著增强了事实生成的可信度。
English: The study introduces Precise Information Control (PIC) to address language model hallucinations by generating outputs strictly based on provided claims, revealing that even advanced models hallucinate over 70% of the time, and proposes a post-training framework that significantly improves faithfulness in factual generation tasks.
Authors:Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer
Abstract:
A central challenge in language models (LMs) is faithfulness hallucination: the generation of information unsubstantiated by input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, without adding any unsupported ones. PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still hallucinate against user-provided input in over 70% of generations. To alleviate this lack of faithfulness, we introduce a post-training framework that uses a weakly supervised preference data construction method to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace fact-checking task, underscoring the potential of precisely grounded generation.
中文: 该研究提出精确信息控制(PIC)任务以解决语言模型的幻觉问题,通过要求模型仅基于给定声明生成内容,发现即使先进模型仍有超过70%的幻觉率,并开发了后训练框架将模型性能提升至91% F1分数,显著增强了事实生成的可信度。
English: The study introduces Precise Information Control (PIC) to address language model hallucinations by generating outputs strictly based on provided claims, revealing that even advanced models hallucinate over 70% of the time, and proposes a post-training framework that significantly improves faithfulness in factual generation tasks.
Authors:Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
Abstract:
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
Chinese: 本文介绍了用于个性化图表标题生成的多模态数据集LaMP-Cap,实验表明,结合多模态背景信息(特别是图像)能显著提升AI生成标题与作者原创标题的契合度,优于纯文本方法。
English: The paper introduces LaMP-Cap, a multimodal dataset for personalized figure caption generation, demonstrating through experiments that incorporating multimodal profile information, especially images, significantly improves the alignment of AI-generated captions with author-written ones compared to text-only approaches.
Authors:Mihir Dharmadhikari, Kostas Alexis
Abstract:
This paper presents a novel semantics-aware inspection path planning paradigm called "Semantics-aware Predictive Planning" (SPP). Industrial environments that require the inspection of specific objects or structures (called "semantics"), such as ballast water tanks inside ships, often present structured and repetitive spatial arrangements of the semantics of interest. Motivated by this, we first contribute an algorithm that identifies spatially repeating patterns of semantics - exact or inexact - in a semantic scene graph representation and makes predictions about the evolution of the graph in the unseen parts of the environment using these patterns. Furthermore, two inspection path planning strategies, tailored to ballast water tank inspection, that exploit these predictions are proposed. To assess the performance of the novel predictive planning paradigm, both simulation and experimental evaluations are performed. First, we conduct a simulation study comparing the method against relevant state-of-the-art techniques and further present tests showing its ability to handle imperfect patterns. Second, we deploy our method onboard a collision-tolerant aerial robot operating inside the ballast tanks of two real ships. The results, both in simulation and field experiments, demonstrate significant improvement over the state-of-the-art in terms of inspection time while maintaining equal or better semantic surface coverage. A set of videos describing the different parts of the method and the field deployments is available at https://tinyurl.com/spp-videos. The code for this work is made available at https://github.com/ntnu-arl/predictive_planning_ros.
中文摘要:本文提出了一种新颖的语义感知预测规划方法,通过识别重复语义模式来预测未探测区域,在模拟和真实船舶压载舱实验中显著提升了检测效率。
English Summary: This paper introduces Semantics-aware Predictive Planning (SPP), a novel inspection path planning method that identifies repeating semantic patterns to predict unseen environments and demonstrates significant efficiency improvements in both simulations and real-world ship tank inspections.
Authors:Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska
Abstract:
Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.
中文: KRAMABENCH通过104个真实数据科学流程构建的基准测试表明,尽管AI模型能胜任明确规范的编程任务,但在需要领域知识和复杂数据处理的实际场景中仍存在不足。
English: KRAMABENCH introduces a benchmark of 104 real-world data science pipelines to evaluate AI systems' end-to-end capabilities, revealing that while models excel at well-defined coding tasks, they struggle with complex data processing requiring domain knowledge.
Authors:Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska
Abstract:
Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.
中文: KRAMABENCH通过104个真实数据科学流程构建的基准测试表明,尽管AI模型能胜任明确规范的编程任务,但在需要领域知识和复杂数据处理的实际场景中仍存在不足。
English: KRAMABENCH introduces a benchmark of 104 real-world data science pipelines to evaluate AI systems' end-to-end capabilities, revealing that while models excel at well-defined coding tasks, they struggle with complex data processing requiring domain knowledge.
Authors:Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash
Abstract:
Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: https://github.com/AdityaLab/TimeRecipe.
Chinese: TimeRecipe框架通过大量实验在模块级别系统评估时间序列预测方法,揭示了超越现有最优方法的设计选择,并提供了基于实证的架构推荐工具包。
English: The TimeRecipe framework systematically evaluates time-series forecasting models at the module level through extensive experiments, revealing optimal design choices that outperform state-of-the-art methods and providing a toolkit for architecture recommendations.
Authors:Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
Abstract:
Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .
中文: 现有大语言模型安全方法易受越狱攻击,因此本研究提出SAFFRON创新推理扩展框架,通过多叉奖励模型显著提升安全防御能力并降低计算开销。
English: Current safety methods for large language models are vulnerable to jailbreak attacks, so this research introduces SAFFRON, a novel inference scaling approach using a multifurcation reward model to enhance safety while reducing computational costs.
Authors:Luis Pinto
Abstract:
Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we challenge this convention by conducting a comprehensive layer-wise analysis of five diverse molecular encoders across 22 ADMET property prediction tasks. Our results demonstrate that embeddings from intermediate layers consistently outperform final-layer representations. Specifically, using fixed embeddings from the optimal intermediate layers improved downstream performance by an average of 5.4%, reaching gains up to 28.6%. Furthermore, finetuning up to these intermediate layers yielded even greater average improvements of 8.5%, with performance increases as high as 40.8%, achieving new state-of-the-art results on several benchmarks. Additionally, a strong positive correlation between fixed embedding performance and finetuning outcomes supports an efficient evaluate-then-finetune approach, enabling identification of optimal layers with reduced computational cost. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code is made publicly available at https://github.com/luispintoc/Unlocking-Chemical-Insights.
中文: 本研究表明,在ADMET性质预测中,分子编码器的中间层嵌入始终优于最终层表示,固定嵌入可将性能提升高达28.6%,微调更可实现40.8%的增益,同时强相关性为高效选择最优层提供了依据。
English: This study reveals that intermediate-layer embeddings from molecular encoders consistently outperform final-layer representations in ADMET property prediction, with fixed embeddings improving performance by up to 28.6% and fine-tuning achieving gains of up to 40.8%, while demonstrating a strong correlation that enables efficient layer selection.
Authors:Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Chen-Fu Chen, Haim Permuter, Sajani Vithana, Flavio P. Calmon
Abstract:
Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater
中文: 本文提出了一种可优化的水印设计框架,开发了HeavyWater和SimplexWater两种水印技术,能在保持文本质量的同时实现高检测精度,适用于各类语言模型和辅助信息生成方式。
English: This paper introduces an optimization framework for designing tunable LLM watermarks, HeavyWater and SimplexWater, which effectively balance detection accuracy and text quality while being applicable to any language model and side information generation method.
Authors:Ali Murad, Bo Hui, Wei-Shinn Ku
Abstract:
Federated Learning (FL) is a distributed framework for collaborative model training over large-scale distributed data, enabling higher performance while maintaining client data privacy. However, the nature of model aggregation at the centralized server can result in a performance drop in the presence of non-IID data across different clients. We remark that training a client locally on more data than necessary does not benefit the overall performance of all clients. In this paper, we devise a novel framework that leverages a Deep Reinforcement Learning (DRL) agent to select an optimized amount of data necessary to train a client model without oversharing information with the server. Starting without awareness of the client's performance, the DRL agent utilizes the change in training loss as a reward signal and learns to optimize the amount of training data necessary for improving the client's performance. Specifically, after each aggregation round, the DRL algorithm considers the local performance as the current state and outputs the optimized weights for each class, in the training data, to be used during the next round of local training. In doing so, the agent learns a policy that creates an optimized partition of the local training dataset during the FL rounds. After FL, the client utilizes the entire local training dataset to further enhance its performance on its own data distribution, mitigating the non-IID effects of aggregation. Through extensive experiments, we demonstrate that training FL clients through our algorithm results in superior performance on multiple benchmark datasets and FL frameworks. Our code is available at https://github.com/amuraddd/optimized_client_training.git.
中文: 联邦学习在非独立同分布数据下性能会下降,本文提出一种深度强化学习框架,通过优化本地训练数据量来提升客户端性能,同时保护数据隐私。
English: Federated Learning can suffer from performance degradation with non-IID data, but this paper introduces a Deep Reinforcement Learning framework that optimizes local data usage during training to enhance client performance while maintaining privacy.
Authors:Jiazheng Kang, Mingming Ji, Zhe Zhao, Ting Bai
Abstract:
Large Language Models (LLMs) face a crucial challenge from fixed context windows and inadequate memory management, leading to a severe shortage of long-term memory capabilities and limited personalization in the interactive experience with AI agents. To overcome this challenge, we innovatively propose a Memory Operating System, i.e., MemoryOS, to achieve comprehensive and efficient memory management for AI agents. Inspired by the memory management principles in operating systems, MemoryOS designs a hierarchical storage architecture and consists of four key modules: Memory Storage, Updating, Retrieval, and Generation. Specifically, the architecture comprises three levels of storage units: short-term memory, mid-term memory, and long-term personal memory. Key operations within MemoryOS include dynamic updates between storage units: short-term to mid-term updates follow a dialogue-chain-based FIFO principle, while mid-term to long-term updates use a segmented page organization strategy. Our pioneering MemoryOS enables hierarchical memory integration and dynamic updating. Extensive experiments on the LoCoMo benchmark show an average improvement of 49.11% on F1 and 46.18% on BLEU-1 over the baselines on GPT-4o-mini, showing contextual coherence and personalized memory retention in long conversations. The implementation code is open-sourced at https://github.com/BAI-LAB/MemoryOS.
Chinese: 为解决大语言模型中固定上下文窗口和内存管理不足的问题,我们提出MemoryOS这一分层内存管理系统,通过设计短期、中期和长期记忆存储单元及动态更新机制,显著提升了AI代理的长期记忆能力和个性化交互体验。
English: To address the limitations of fixed context windows and inadequate memory management in LLMs, we propose MemoryOS, a hierarchical memory management system that enhances long-term memory and personalization for AI agents, achieving significant performance improvements in contextual coherence and memory retention.
Authors:Masoud Rahimi, Reza Karbasi, Abdol-Hossein Vahabie
Abstract:
We introduce an open-source Python framework for generating synthetic ECG image datasets to advance critical deep learning-based tasks in ECG analysis, including ECG digitization, lead region and lead name detection, and pixel-level waveform segmentation. Using the PTB-XL signal dataset, our proposed framework produces four open-access datasets: (1) ECG images in various lead configurations paired with time-series signals for ECG digitization, (2) ECG images annotated with YOLO-format bounding boxes for detection of lead region and lead name, (3)-(4) cropped single-lead images with segmentation masks compatible with U-Net-based models in normal and overlapping versions. In the overlapping case, waveforms from neighboring leads are superimposed onto the target lead image, while the segmentation masks remain clean. The open-source Python framework and datasets are publicly available at https://github.com/rezakarbasi/ecg-image-and-signal-dataset and https://doi.org/10.5281/zenodo.15484519, respectively.
中文: 该开源Python框架基于PTB-XL信号生成合成心电图图像数据集,支持心电图数字化、导联区域识别和波形分割等深度学习任务,四类开放数据集可通过GitHub和Zenodo公开获取。
English: This open-source Python framework generates synthetic ECG image datasets from PTB-XL signals to support deep learning tasks like ECG digitization, lead detection, and waveform segmentation, with four publicly available datasets accessible via GitHub and Zenodo.
Authors:Xiaoyu Sun, Yang Yang, Xunde Dong
Abstract:
In the field of automatic Electrocardiogram (ECG) diagnosis, due to the relatively limited amount of labeled data, how to build a robust ECG pretrained model based on unlabeled data is a key area of focus for researchers. Recent advancements in contrastive learning-based ECG pretrained models highlight the potential of exploiting the additional patient-level self-supervisory signals inherent in ECG. They are referred to as patient contrastive learning. Its rationale is that multiple physical recordings from the same patient may share commonalities, termed patient consistency, so redefining positive and negative pairs in contrastive learning as intrapatient and inter-patient samples provides more shared context to learn an effective representation. However, these methods still fail to efficiently exploit patient consistency due to the insufficient amount of intra-inter patient samples existing in a batch. Hence, we propose a contrastive learning-based ECG pretrained model enhanced by the Patient Memory Queue (PMQ), which incorporates a large patient memory queue to mitigate model degeneration that can arise from insufficient intra-inter patient samples. In order to further enhance the performance of the pretrained model, we introduce two extra data augmentation methods to provide more perspectives of positive and negative pairs for pretraining. Extensive experiments were conducted on three public datasets with three different data ratios. The experimental results show that the comprehensive performance of our method outperforms previous contrastive learning methods and exhibits greater robustness in scenarios with limited labeled data. The code is available at https://github.com/3hiuwoo/PMQ.
中文: 本研究提出了一种基于患者记忆队列增强的心电图对比学习预训练模型,通过解决批次内患者样本不足的问题提升了模型鲁棒性,在标注数据有限的情况下显著优于现有方法。
English: This study introduces a Patient Memory Queue-enhanced contrastive learning model for ECG diagnosis, which improves robustness by addressing insufficient intra-inter patient samples and outperforms previous methods in limited labeled data scenarios.
Authors:Junyi Liu, Stanley Kok
Abstract:
Agencies such as Standard & Poor's and Moody's provide bank credit ratings that influence economic stability and decision-making by stakeholders. Accurate and timely predictions support informed decision-making, regulatory actions, and investor protection. However, a complete interbank connection graph is often unavailable due to privacy concerns, complicating the direct application of Graph Neural Networks (GNNs) for rating prediction. our research utilizes persistent homology to construct a network that captures relationships among banks and combines this with a traditional lending network to create a heterogeneous network that integrates information from both sources, leading to improved predictions. Experiments on a global, real-world dataset validate the effectiveness of HTGNN. This research has implications for investors and regulatory bodies in enhancing proactive risk mitigation and the implementation of effective market interventions.The code can be find at https://github.com/Liu-Jun-Yi/HTGNN.
Chinese: 本研究通过持久同调构建银行关系网络,并将其与借贷网络结合成异质结构,提升了银行信用评级预测的准确性,经真实数据验证有助于投资者和监管机构进行风险管理。
English: This study enhances bank credit rating predictions by using persistent homology to build a bank relationship network and combining it with a lending network into a heterogeneous structure, which is validated on real-world data to aid investors and regulators in risk management.
Authors:Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, Salman Khan
Abstract:
Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover.TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: https://github.com/mbzuai-oryx/TerraFM .
中文摘要:TerraFM是一种自监督基础模型,通过整合全球哨兵卫星影像和创新的训练策略,在地球观测任务中实现了卓越性能。
English Summary: TerraFM is a self-supervised foundation model that integrates global Sentinel-1 and Sentinel-2 imagery with innovative training strategies to achieve superior performance in Earth observation tasks.
Authors:Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, Yoichi Sato
Abstract:
Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the machines to leverage the synergistic potential of these dual perspectives has emerged as a compelling research direction in video understanding. In this survey, we provide a comprehensive review of video understanding from both exocentric and egocentric viewpoints. We begin by highlighting the practical applications of integrating egocentric and exocentric techniques, envisioning their potential collaboration across domains. We then identify key research tasks to realize these applications. Next, we systematically organize and review recent advancements into three main research directions: (1) leveraging egocentric data to enhance exocentric understanding, (2) utilizing exocentric data to improve egocentric analysis, and (3) joint learning frameworks that unify both perspectives. For each direction, we analyze a diverse set of tasks and relevant works. Additionally, we discuss benchmark datasets that support research in both perspectives, evaluating their scope, diversity, and applicability. Finally, we discuss limitations in current works and propose promising future research directions. By synthesizing insights from both perspectives, our goal is to inspire advancements in video understanding and artificial intelligence, bringing machines closer to perceiving the world in a human-like manner. A GitHub repo of related works can be found at https://github.com/ayiyayi/Awesome-Egocentric-and-Exocentric-Vision.
中文摘要:本综述全面探讨了自我中心与异我中心视角在视频理解中的协同应用,系统梳理了双向增强方法与联合学习框架的研究进展,旨在推动人工智能实现更接近人类的感知能力。
English Summary: This survey comprehensively reviews video understanding by integrating egocentric and exocentric perspectives, exploring their synergistic applications, current advancements, and future directions to enhance AI's human-like perception.
Authors:Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs' framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at https://github.com/WebPAI/DesignBench.
中文:DesignBench作为一个综合性基准,通过整合多种UI框架和任务解决了当前多模态大语言模型评估的局限性,能够对实际前端开发工作流中的性能表现进行多维度分析。
English: DesignBench is a comprehensive benchmark addressing limitations in current multimodal large language model evaluations by incorporating multiple UI frameworks and tasks, enabling detailed analysis of performance across real-world front-end development workflows.
Authors:Akram Zaytar, Caleb Robinson, Girmaw Abebe Tadesse, Tammy Glazer, Gilles Hacheme, Anthony Ortiz, Rahul M Dodhia, Juan M Lavista Ferres
Abstract:
Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization
中文摘要:优化的GeoTIFF加载配置大幅提升了地球观测深度学习模型的数据吞吐量和GPU利用率,在保持与本地训练同等精度的同时显著提高了效率。
English Summary: Optimized GeoTIFF loading configurations significantly enhance data throughput and GPU utilization for Earth observation deep learning models, achieving comparable accuracy to local training while improving efficiency.
Authors:Christian Fruhwirth-Reisinger, DuÅ¡an MaliÄ, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger
Abstract:
We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.
中文: STSBench提出了一种基于场景的框架来评估自动驾驶中的视觉语言模型,通过NuScenes数据集创建了STSnu基准测试,用于检验模型在多视角和多帧下的时空推理能力,揭示了现有模型在理解复杂交通动态方面存在严重不足。
English: STSBench introduces a scenario-based framework for evaluating vision-language models in autonomous driving, creating the STSnu benchmark from the NuScenes dataset to test spatio-temporal reasoning across multiple views and frames, revealing critical gaps in current models' understanding of traffic dynamics.
Authors:Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang
Abstract:
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
中文: PuzzleWorld是一个包含667个解谜式问题的新基准,用于评估开放式多模态推理,当前最优模型仅解决14%的谜题,并暴露出短视推理和缺乏草图能力等局限。
English: PuzzleWorld is a new benchmark of 667 puzzlehunt problems that tests open-ended multimodal reasoning, where current top models struggle with only 14% solved and reveal limitations like myopic reasoning and lack of sketching skills.
Authors:Dimitrios Proios, Alban Bornet, Anthony Yazdani, Jose F Rodrigues, Douglas Teodoro
Abstract:
Patient stratification identifying clinically meaningful subgroups is essential for advancing personalized medicine through improved diagnostics and treatment strategies. Electronic health records (EHRs), particularly those from intensive care units (ICUs), contain rich temporal clinical data that can be leveraged for this purpose. In this work, we introduce ICU-TSB (Temporal Stratification Benchmark), the first comprehensive benchmark for evaluating patient stratification based on temporal patient representation learning using three publicly available ICU EHR datasets. A key contribution of our benchmark is a novel hierarchical evaluation framework utilizing disease taxonomies to measure the alignment of discovered clusters with clinically validated disease groupings. In our experiments with ICU-TSB, we compared statistical methods and several recurrent neural networks, including LSTM and GRU, for their ability to generate effective patient representations for subsequent clustering of patient trajectories. Our results demonstrate that temporal representation learning can rediscover clinically meaningful patient cohorts; nevertheless, it remains a challenging task, with v-measuring varying from up to 0.46 at the top level of the taxonomy to up to 0.40 at the lowest level. To further enhance the practical utility of our findings, we also evaluate multiple strategies for assigning interpretable labels to the identified clusters. The experiments and benchmark are fully reproducible and available at https://github.com/ds4dh/CBMS2025stratification.
中文: ICU-TSB是首个基于ICU电子健康记录时序表征学习的患者分层综合基准,其分层评估框架表明时序模型能识别临床相关患者群组,但在聚类准确性方面仍存在挑战。
English: ICU-TSB is the first comprehensive benchmark for evaluating patient stratification using temporal representation learning from ICU EHR data, featuring a hierarchical evaluation framework that demonstrates temporal models can identify clinically meaningful patient cohorts while revealing persistent challenges in clustering accuracy.
Authors:Jinyu Yang, Cheng Yang, Shanyuan Cui, Zeyuan Guo, Liangwei Yang, Muhan Zhang, Zhiqiang Zhang, Chuan Shi
Abstract:
Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. With the rapid advancement of large language models (LLMs), a recent study explored the integration of HGNNs with LLMs for generalizable heterogeneous graph learning. However, this approach typically encodes structural information as HG tokens using HGNNs, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM's comprehension of HGs. Moreover, since these HG tokens are often derived from node-level tasks, the model's ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style 'mask' token prediction paradigm. Specifically,MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.
中文摘要:MLM4HG是一种创新方法,通过将异质图转换为基于元路径的文本序列,并利用填空式模板统一不同图任务,在语言模型微调后实现了优异的跨领域泛化性能。
English Summary: MLM4HG is a novel method that converts heterogeneous graphs into metapath-based textual sequences and unifies graph tasks through cloze-style templates, enabling superior cross-domain generalization when fine-tuned with language models.
Authors:Wenyuan Li, Shunlin Liang, Yuxiang Zhang, Liqin Liu, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Zhenwei Shi
Abstract:
Fine-grained crop type classification serves as the fundamental basis for large-scale crop mapping and plays a vital role in ensuring food security. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchical classification head with hierarchical fusion then simultaneously delivers multi-level crop type classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirm our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset are available at https://github.com/flyakon/H2Crop.
中文: 本研究提出了H2Crop分层高光谱作物数据集及双流Transformer模型,通过协同处理高光谱与多时相卫星数据,显著提升了细粒度作物分类精度,相关代码与数据已开源。
English: This study introduces H2Crop, a hierarchical hyperspectral crop dataset, and a dual-stream Transformer model that synergistically combines hyperspectral and multi-temporal satellite data to significantly improve fine-grained crop classification accuracy, with codes and data publicly available.
Authors:Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
Abstract:
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the retrieval and generation stages as disjointed, limiting their effectiveness. We propose Joint-GCG, the first framework to unify gradient-based attacks across both retriever and generator models through three innovations: (1) Cross-Vocabulary Projection for aligning embedding spaces, (2) Gradient Tokenization Alignment for synchronizing token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves at most 25% and an average of 5% higher attack success rate than previous methods across multiple retrievers and generators. While optimized under a white-box assumption, the generated poisons show unprecedented transferability to unseen models. Joint-GCG's innovative unification of gradient-based attacks across retrieval and generation stages fundamentally reshapes our understanding of vulnerabilities within RAG systems. Our code is available at https://github.com/NicerWang/Joint-GCG.
中文摘要:Joint-GCG作为首个统一检索与生成阶段的梯度攻击框架,通过跨词汇投影、梯度标记对齐和自适应加权融合三大创新,显著提升了RAG系统投毒攻击成功率并具备前所未有的跨模型迁移能力。
English Summary: Joint-GCG is a unified gradient-based attack framework that enhances poisoning effectiveness in RAG systems by simultaneously targeting both retrieval and generation stages, achieving up to 25% higher success rates with strong transferability across models.
Authors:David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Abstract:
Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.
中文: 本研究提出CLaMR多模态视频检索系统,通过动态选择相关模态并联合编码四种内容类型,结合合成数据集和模态感知训练方法,显著超越了现有检索模型的性能表现。
English: The study introduces CLaMR, a multimodal video retriever that dynamically selects relevant modalities and jointly encodes four types of content, achieving significant performance improvements over existing methods by using a synthetic dataset and a modality-aware training approach.
Authors:Xudong Zhang, Renato Cordeiro de Amorim
Abstract:
Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted $k$-means++, a novel initialisation strategy for the Minkowski Weighted $k$-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical guarantee under mild assumptions and extensive experiments showing that our methods consistently outperform existing alternatives. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.
中文: 本文提出了Minkowski加权$k$-means++初始化方法及两种基于特征相关性的特征选择算法,通过理论保证和实验验证了其在提升高维数据聚类性能上的显著优势。
English: This paper introduces Minkowski weighted $k$-means++ initialization and two feature selection algorithms that leverage feature relevance to enhance clustering in high-dimensional data, supported by theoretical guarantees and superior experimental results.
Authors:Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, Di Wang
Abstract:
Growing concerns over data privacy and security highlight the importance of machine unlearning--removing specific data influences from trained models without full retraining. Techniques like Membership Inference Attacks (MIAs) are widely used to externally assess successful unlearning. However, existing methods face two key limitations: (1) maximizing MIA effectiveness (e.g., via online attacks) requires prohibitive computational resources, often exceeding retraining costs; (2) MIAs, designed for binary inclusion tests, struggle to capture granular changes in approximate unlearning. To address these challenges, we propose the Interpolated Approximate Measurement (IAM), a framework natively designed for unlearning inference. IAM quantifies sample-level unlearning completeness by interpolating the model's generalization-fitting behavior gap on queried samples. IAM achieves strong performance in binary inclusion tests for exact unlearning and high correlation for approximate unlearning--scalable to LLMs using just one pre-trained shadow model. We theoretically analyze how IAM's scoring mechanism maintains performance efficiently. We then apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning, underscoring the need for stronger safeguards in approximate unlearning systems. The code is available at https://github.com/Happy2Git/Unlearning_Inference_IAM.
中文摘要:提出的插值近似测量(IAM)框架通过泛化拟合差距量化样本级遗忘完整性,有效评估机器遗忘效果,克服了现有方法的计算和粒度限制,并揭示了近似遗忘系统中的普遍风险。
English Summary: The proposed Interpolated Approximate Measurement (IAM) framework efficiently evaluates machine unlearning by quantifying sample-level completeness through generalization-fitting gaps, overcoming computational and granular limitations of existing methods while revealing risks in approximate unlearning systems.
Authors:Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange
Abstract:
While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyperparameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting large language models (LLMs) on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements.
Our code is available at https://github.com/SakanaAI/text-to-lora
基础模型通常需要针对特定任务进行微调,这一过程成本高昂且对设置敏感,而Text-to-LoRA (T2L) 通过自然语言描述实现即时适配,能以极低计算量生成高效的LoRA适配器。
Foundation models often need task-specific fine-tuning, which is costly and sensitive to settings, but Text-to-LoRA (T2L) enables on-the-fly adaptation using natural language descriptions, generating effective LoRA adapters with minimal computational effort.
Authors:Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, Luca Ambrogioni
Abstract:
While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
中文摘要:提出的反馈引导(FBG)方法根据条件信号的信息性在推理过程中自适应调节引导强度,在ImageNet512x512上显著优于分类器自由引导,同时保持数学严谨性并能与现有引导方案兼容。
English Summary: The proposed FeedBack Guidance (FBG) adaptively regulates guidance during inference based on conditional signal informativeness, outperforming Classifier-Free Guidance on ImageNet512x512 while maintaining mathematical rigor and compatibility with existing methods.
Authors:Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, Luca Ambrogioni
Abstract:
While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
中文摘要:提出的反馈引导(FBG)方法根据条件信号的信息性在推理过程中自适应调节引导强度,在ImageNet512x512上显著优于分类器自由引导,同时保持数学严谨性并能与现有引导方案兼容。
English Summary: The proposed FeedBack Guidance (FBG) adaptively regulates guidance during inference based on conditional signal informativeness, outperforming Classifier-Free Guidance on ImageNet512x512 while maintaining mathematical rigor and compatibility with existing methods.
Authors:Julio Silva-RodrÃguez, Leo Fillioux, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Ismail Ben Ayed, Jose Dolz
Abstract:
Vision-language models (VLMs) pre-trained at large scale have shown unprecedented transferability capabilities and are being progressively integrated into medical image analysis. Although its discriminative potential has been widely explored, its reliability aspect remains overlooked. This work investigates their behavior under the increasingly popular split conformal prediction (SCP) framework, which theoretically guarantees a given error level on output sets by leveraging a labeled calibration set. However, the zero-shot performance of VLMs is inherently limited, and common practice involves few-shot transfer learning pipelines, which cannot absorb the rigid exchangeability assumptions of SCP. To alleviate this issue, we propose full conformal adaptation, a novel setting for jointly adapting and conformalizing pre-trained foundation models, which operates transductively over each test data point using a few-shot adaptation set. Moreover, we complement this framework with SS-Text, a novel training-free linear probe solver for VLMs that alleviates the computational cost of such a transductive approach. We provide comprehensive experiments using 3 different modality-specialized medical VLMs and 9 adaptation tasks. Our framework requires exactly the same data as SCP, and provides consistent relative improvements of up to 27% on set efficiency while maintaining the same coverage guarantees.
Chinese: 本研究提出了全适应性共形预测的新框架,用于联合调整和共形化预训练的视觉语言模型在医学图像分析中的应用,在保持覆盖保证的同时将集合效率提升高达27%,并辅以无需训练的SS-Text求解器来降低计算成本。
English: This study introduces full conformal adaptation, a novel framework that jointly adapts and conformalizes pre-trained vision-language models for medical image analysis, enhancing set efficiency by up to 27% while maintaining coverage guarantees, and complements it with SS-Text, a training-free solver to reduce computational costs.
Authors:Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister
Abstract:
Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.
中文: 本研究提出了一种新颖的零样本检测方法,通过评估近似任务条件下的标记熵来区分LLM生成的代码与人工编写的代码,该方法无需访问生成模型或原始提示,就在多种编程语言中实现了最优性能。
English: This study introduces a novel zero-shot detection method that leverages task-level conditioning to distinguish LLM-generated code from human-written code by evaluating token entropy under approximated task conditions, achieving state-of-the-art performance across multiple programming languages without requiring access to the generator model or original prompts.
Authors:Taoran Yue, Xiaojin Lu, Jiaxi Cai, Yuanping Chen, Shibing Chu
Abstract:
Current CNN-based infrared small target detection(IRSTD) methods generally overlook the heterogeneity between shallow and deep features, leading to inefficient collaboration between shallow fine grained structural information and deep high-level semantic representations. Additionally, the dependency relationships and fusion mechanisms across different feature hierarchies lack systematic modeling, which fails to fully exploit the complementarity of multilevel features. These limitations hinder IRSTD performance while incurring substantial computational costs. To address these challenges, this paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations to increase both the detection accuracy and computational efficiency in IRSTD tasks. SDS-Net introduces a dual-branch architecture that separately models the structural characteristics and semantic properties of features, effectively preserving shallow spatial details while capturing deep semantic representations, thereby achieving high-precision detection with significantly improved inference speed. Furthermore, the network incorporates an adaptive feature fusion module to dynamically model cross-layer feature correlations, enhancing overall feature collaboration and representation capability. Comprehensive experiments on three public datasets (NUAA-SIRST, NUDT-SIRST, and IRSTD-1K) demonstrate that SDS-Net outperforms state-of-the-art IRSTD methods while maintaining low computational complexity and high inference efficiency, showing superior detection performance and broad application prospects. Our code will be made public at https://github.com/PhysiLearn/SDS-Net.
中文: 本文提出的SDS-Net通过双分支结构和自适应特征融合模块,有效协同浅层与深层特征,在提升红外小目标检测精度的同时显著提高了计算效率。
English: This paper introduces SDS-Net, a dual-branch network that effectively models shallow and deep features through adaptive fusion, enhancing infrared small target detection accuracy and efficiency while reducing computational costs.
Authors:Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu
Abstract:
Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at https://github.com/diehlj/fast-iterated-sums.
中文摘要:本文提出了一种基于排列计数中“角树”数学工具的新型张量到张量线性计算层,该层既是状态空间模型的高阶推广,也是迭代积分的多参数扩展,在保持与大型网络相近精度的同时显著减少了参数数量和计算量。
English Summary: This paper introduces a novel tensor-to-tensor layer with linear computational cost using "corner trees" from permutation counting, which serves as both a higher-order generalization of state-space models and a multiparameter extension of iterated integrals, achieving comparable accuracy to larger networks while reducing parameters and computational operations.
Authors:Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Abstract:
Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: https://github.com/tmlr-group/SSNI.
中文: 本文提出SSNI框架,通过预训练评分网络评估样本偏离干净数据分布的程度,自适应调整每个样本的噪声注入水平,在CIFAR-10和ImageNet-1K数据集上显著提升了基于扩散的净化方法的准确性和鲁棒性。
English: The paper introduces SSNI, a framework that adaptively adjusts noise injection levels in diffusion-based purification methods based on each sample's deviation from the clean data distribution, improving accuracy and robustness on benchmark datasets.
Authors:Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Ruben Martinez-Cantin, Jose J. Guerrero
Abstract:
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22% and +76% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13% and +6% compared with the SOTA with 1% of the training parameters.
Chinese: 本文提出O-MaMa方法,将跨图像分割重新定义为掩码匹配任务,通过在Ego-Exo4D基准测试中取得最先进性能,以极少的训练参数实现了显著性能提升。
English: The paper presents O-MaMa, a novel cross-image segmentation method that redefines the task as mask matching and achieves state-of-the-art results on the Ego-Exo4D benchmark with significant performance gains using minimal training parameters.
Authors:Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu
Abstract:
Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable at https://github.com/therontau0054/Unisoma.
中文: 本文提出Unisoma,一种基于Transformer的显式建模方法,通过结构化模块精确捕捉多固体系统中的物理相互作用,在多个数据集和复杂任务中均实现了最先进的性能。
English: This paper introduces Unisoma, an explicit Transformer-based model that accurately captures physical interactions in multi-solid systems through structured modules, achieving state-of-the-art performance across diverse datasets and tasks.
Authors:Zeqi Zhou, Fang Wu, Shayan Talaei, Haokai Zhao, Cheng Meixin, Tinson Xu, Amin Saberi, Yejin Choi
Abstract:
Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context's reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at https://github.com/smiles724/Self-Reflective-Debates.
中文: SR-DCR框架通过多智能体辩论机制解决参数知识与上下文输入的冲突,在保持准确性的同时显著提升对误导性语境的鲁棒性,且计算开销极低。
English: The SR-DCR framework uses a multi-agent debate process to resolve conflicts between parametric knowledge and contextual input, improving robustness against misleading contexts while maintaining accuracy with minimal computational cost.
Authors:Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li
Abstract:
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at https://github.com/Ericccc02/AgentSwift.
中文: 本文提出一个综合框架,通过整合分层搜索空间、预测性能模型和不确定性引导的MCTS,显著提升LLM智能体性能,在七大基准测试中平均提升8.34%。
English: This paper introduces a comprehensive framework that enhances LLM agent performance by integrating hierarchical search spaces, predictive value models, and uncertainty-guided MCTS, achieving an 8.34% average improvement across seven benchmarks.
Authors:Kaiyuan Chen, Zhengjie Hu, Shaolin Zhang, Yuanqing Xia, Wannian Liang, Shuo Wang
Abstract:
The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significant challenges, including collision avoidance, coverage efficiency, and constrained flight environments. In this study, we propose an enhanced trust region sequential convex optimization (TR-SCO) algorithm for optimal trajectory planning of multiple drones performing thermal screening tasks. Our improved algorithm integrates a refined convex optimization formulation within a trust region framework, effectively balancing trajectory smoothness, obstacle avoidance, altitude constraints, and maximum screening coverage. Simulation results demonstrate that our approach significantly improves trajectory optimality and computational efficiency compared to conventional convex optimization methods. This research provides critical insights and practical contributions toward deploying efficient multi-drone systems for real-time thermal screening in urban areas. For reader who are interested in our research, we release our source code at https://github.com/Cherry0302/Enhanced-TR-SCO.
中文: 本研究提出了一种改进的信任域序列凸优化算法,用于城市区域的多无人机热筛查,在确保安全约束的同时显著提升了轨迹规划效率和覆盖范围。
English: This study introduces an improved trust region sequential convex optimization algorithm for multi-drone thermal screening in urban areas, which enhances trajectory planning efficiency and coverage while ensuring safety constraints.
Authors:Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang
Abstract:
The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).
中文摘要:AvR方法通过引入结合可微分学习的优化过程,利用长链思维增强大语言模型的递归推理能力,仅用少量合成数据就显著超越传统方法,使LLaMA-3-8B-Instruct模型在AlpacaEval 2.0上的胜率提升超20%。
English Summary: The AvR method enhances LLMs' recursive reasoning through long-form Chain of Thought by integrating refinement processes with differentiable learning, significantly outperforming traditional methods and boosting LLaMA-3-8B-Instruct's performance by over 20% with minimal data.
Authors:Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
Abstract:
For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.
Chinese: 本研究提出了一种退火方法,在连续动作空间的行动者-评论家框架中从贝尔曼最优算子过渡到贝尔曼算子,既加速了学习过程又缓解了高估偏差,在各种任务中显著提升了性能和鲁棒性。
English: This study introduces an annealing approach that transitions from the Bellman optimality operator to the Bellman operator in actor-critic frameworks for continuous action spaces, accelerating learning while mitigating overestimation bias and significantly improving performance and robustness in various tasks.
Authors:Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric Xing, Zhiqiang Shen
Abstract:
Graph Neural Networks (GNNs) often encounter significant performance degradation under distribution shifts between training and test data, hindering their applicability in real-world scenarios. Recent studies have proposed various methods to address the out-of-distribution generalization challenge, with many methods in the graph domain focusing on directly identifying an invariant subgraph that is predictive of the target label. However, we argue that identifying the edges from the invariant subgraph directly is challenging and error-prone, especially when some spurious edges exhibit strong correlations with the targets. In this paper, we propose PrunE, the first pruning-based graph OOD method that eliminates spurious edges to improve OOD generalizability. By pruning spurious edges, PrunE retains the invariant subgraph more comprehensively, which is critical for OOD generalization. Specifically, PrunE employs two regularization terms to prune spurious edges: 1) graph size constraint to exclude uninformative spurious edges, and 2) $ε$-probability alignment to further suppress the occurrence of spurious edges. Through theoretical analysis and extensive experiments, we show that PrunE achieves superior OOD performance and outperforms previous state-of-the-art methods significantly. Codes are available at: \href{https://github.com/tianyao-aka/PrunE-GraphOOD}{https://github.com/tianyao-aka/PrunE-GraphOOD}.
Chinese: 本文提出PrunE方法,通过图大小约束和ε概率对齐剪除虚假边,提升图神经网络在分布外场景下的泛化能力,实验证明其显著优于现有最优方法。
English: The paper introduces PrunE, a novel pruning-based method that enhances out-of-distribution generalization in Graph Neural Networks by eliminating spurious edges through graph size constraints and ε-probability alignment, achieving superior performance over existing approaches.
Authors:Jana Straková, Milan Straka
Abstract:
We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at https://ufal.mff.cuni.cz/nametag, source code at https://github.com/ufal/nametag3, and trained models via https://lindat.cz. The REST service and the web application can be found at https://lindat.mff.cuni.cz/services/nametag/. A demonstration video is available at https://www.youtube.com/watch?v=-gaGnP0IV8A.
NameTag 3 是一款开源工具及云端服务,支持多语言命名实体识别,在 15 种语言的 21 个数据集上达到顶尖性能,并通过精调模型提供扁平与嵌套实体识别功能。
NameTag 3 is an open-source tool and cloud service for multilingual named entity recognition, achieving top performance across 21 datasets in 15 languages and offering both flat and nested entity support through fine-tuned models.
Authors:Xinjie Zhang, Wenxuan Wang, Qin Jin
Abstract:
In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter's motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.
中文摘要:本文提出的IntentionESC框架和ICECoT机制通过明确支持者意图并将其与适当策略关联,提升了情感支持对话的效果,并通过自动化标注和综合评估验证了其有效性。
English Summary: The IntentionESC framework and ICECoT mechanism are introduced to enhance emotional support in conversations by clarifying supporter intentions and linking them to appropriate strategies, with automated annotation and comprehensive evaluation validating their effectiveness.
Authors:Yixuan Zhu, Haolin Wang, Shilin Ma, Wenliang Zhao, Yansong Tang, Lei Chen, Jie Zhou
Abstract:
Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at https://github.com/EternalEvan/FADE .
中文摘要:FADE是一种无需训练的视频编辑方法,通过频域感知分解和频谱引导调制技术,充分利用预训练视频扩散模型的内在先验,在保持时空连贯性的同时实现高效高质量的视频编辑。
English Summary: FADE is a training-free video editing method that leverages pre-trained video diffusion models through frequency-aware factorization and spectrum-guided modulation to achieve high-quality, temporally coherent edits without heavy computational costs.
Authors:Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Abstract:
Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.
中文: 提出的异构混合适配器方法通过动态整合多样化适配器专家,克服了同构MoE-LoRA架构的局限性,在大语言模型微调中实现了性能与参数效率的双重提升。
English: The proposed heterogeneous Mixture-of-Adapters (MoA) method overcomes limitations of homogeneous MoE-LoRA architectures by dynamically integrating diverse adapter experts, enhancing both performance and parameter efficiency in large language model fine-tuning.
Authors:Xiaofei Xu, Xiuzhen Zhang, Ke Deng
Abstract:
Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs' self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at https://github.com/xxfwin/MisMitiFact.
Chinese: MisMitiFact 是一个高效框架,通过轻量级评论模型生成基于事实的反驳回应来对抗虚假信息,在达到与大型语言模型自反馈相当质量的同时,将反馈生成吞吐量提升约5倍,适用于大规模应用。
English: MisMitiFact is an efficient framework that uses lightweight critique models to generate fact-grounded counter-responses against misinformation, achieving comparable quality to LLM self-feedback while significantly increasing throughput by approximately 5 times for scalable mitigation.
Authors:Xin Zhang, Dongdong Meng, Sheng Li
Abstract:
Medical segmentation plays an important role in clinical applications like radiation therapy and surgical guidance, but acquiring clinically acceptable results is difficult. In recent years, progress has been witnessed with the success of utilizing transformer-like models, such as combining the attention mechanism with CNN. In particular, transformer-based segmentation models can extract global information more effectively, compensating for the drawbacks of CNN modules that focus on local features. However, utilizing transformer architecture is not easy, because training transformer-based models can be resource-demanding. Moreover, due to the distinct characteristics in the medical field, especially when encountering mid-sized and small organs with compact regions, their results often seem unsatisfactory. For example, using ViT to segment medical images directly only gives a DSC of less than 50\%, which is far lower than the clinically acceptable score of 80\%. In this paper, we used Mask2Former with deformable attention to reduce computation and proposed offset adjustment strategies to encourage sampling points within the same organs during attention weights computation, thereby integrating compact foreground information better. Additionally, we utilized the 4th feature map in Mask2Former to provide a coarse location of organs, and employed an FCN-based auxiliary head to help train Mask2Former more quickly using Dice loss. We show that our model achieves SOTA (State-of-the-Art) performance on the HaNSeg and SegRap2023 datasets, especially on mid-sized and small organs.Our code is available at link https://github.com/earis/Offsetadjustment\_Background-location\_Decoder\_Mask2former.
Chinese: 本文提出了一种基于Mask2Former的医学图像分割模型,采用可变形注意力和偏移调整策略来更好地捕捉紧凑器官结构,在降低计算需求的同时,在基准数据集上实现了最先进的性能。
English: This paper introduces a Mask2Former-based medical image segmentation model with deformable attention and offset adjustment strategies to better capture compact organ structures, achieving state-of-the-art performance on benchmark datasets while reducing computational demands.
Authors:Adrien Petralia, Paul Boniol, Philippe Charpentier, Themis Palpanas
Abstract:
Improving smart grid system management is crucial in the fight against climate change, and enabling consumers to play an active role in this effort is a significant challenge for electricity suppliers. In this regard, millions of smart meters have been deployed worldwide in the last decade, recording the main electricity power consumed in individual households. This data produces valuable information that can help them reduce their electricity footprint; nevertheless, the collected signal aggregates the consumption of the different appliances running simultaneously in the house, making it difficult to apprehend. Non-Intrusive Load Monitoring (NILM) refers to the challenge of estimating the power consumption, pattern, or on/off state activation of individual appliances using the main smart meter signal. Recent methods proposed to tackle this task are based on a fully supervised deep-learning approach that requires both the aggregate signal and the ground truth of individual appliance power. However, such labels are expensive to collect and extremely scarce in practice, as they require conducting intrusive surveys in households to monitor each appliance. In this paper, we introduce CamAL, a weakly supervised approach for appliance pattern localization that only requires information on the presence of an appliance in a household to be trained. CamAL merges an ensemble of deep-learning classifiers combined with an explainable classification method to be able to localize appliance patterns. Our experimental evaluation, conducted on 4 real-world datasets, demonstrates that CamAL significantly outperforms existing weakly supervised baselines and that current SotA fully supervised NILM approaches require significantly more labels to reach CamAL performances. The source of our experiments is available at: https://github.com/adrienpetralia/CamAL. This paper appeared in ICDE 2025.
中文摘要:提升智能电网管理对应对气候变化至关重要,本文提出CamAL弱监督方法,仅需家电存在信息即可定位用电模式,在减少标注需求的同时显著优于现有技术。
English Summary: Improving smart grid management is vital for climate action, and this paper introduces CamAL, a weakly supervised method that localizes appliance usage patterns using only household presence data, outperforming existing approaches with fewer labels.
Authors:Yiheng Li, Yang Yang, Zichang Tan, Huan Liu, Weihua Chen, Xu Zhou, Zhen Lei
Abstract:
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.
Chinese Summary: 针对现有方法在检测多模态虚假新闻时难以感知细粒度伪造内容的问题,本文提出上下文语义一致性学习(CSCL)方法,通过双分支解码器分别捕捉模态内上下文一致性和跨模态语义一致性,在内容篡改检测与定位任务中取得了最优性能。
English Summary: To address the limitations of existing methods in detecting fine-grained inconsistencies in multimodal fake news, this paper introduces a Contextual-Semantic Consistency Learning (CSCL) approach that employs dual decoders to capture both within-modality contextual and cross-modality semantic consistencies, achieving state-of-the-art performance in manipulation detection and localization.
Authors:Junpeng Lin, Tian Lan, Bo Zhang, Ke Lin, Dandan Miao, Huiru He, Jiantao Ye, Chen Zhang, Yan-fu Li
Abstract:
Forecasting non-stationary time series is a challenging task because their statistical properties often change over time, making it hard for deep models to generalize well. Instance-level normalization techniques can help address shifts in temporal distribution. However, most existing methods overlook the multi-component nature of time series, where different components exhibit distinct non-stationary behaviors. In this paper, we propose Wavelet-based Disentangled Adaptive Normalization (WDAN), a model-agnostic framework designed to address non-stationarity in time series forecasting. WDAN uses discrete wavelet transforms to break down the input into low-frequency trends and high-frequency fluctuations. It then applies tailored normalization strategies to each part. For trend components that exhibit strong non-stationarity, we apply first-order differencing to extract stable features used for predicting normalization parameters. Extensive experiments on multiple benchmarks demonstrate that WDAN consistently improves forecasting accuracy across various backbone model. Code is available at this repository: https://github.com/MonBG/WDAN.
中文: 本文提出WDAN框架,通过小波变换将时间序列分解为趋势和波动分量,并针对各分量应用定制化归一化策略,有效处理非平稳性以提升预测精度。
English: This paper introduces WDAN, a model-agnostic framework that uses wavelet transforms to decompose time series into trends and fluctuations, applying customized normalization to each component to enhance forecasting accuracy by addressing non-stationarity.
Authors:Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Luc Van Gool
Abstract:
In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at https://github.com/lovelyqian/ObjectRelator.
中文: 本报告针对2025年Ego-Exo4D对应挑战赛提出了一种跨视角多模态物体分割方法,通过多模态条件融合模块和跨视角物体对齐模块提升物体定位与一致性,在基准测试中排名第二。
English: This report introduces a cross-view multi-modal object segmentation method for the Ego-Exo4D Correspondence Challenges 2025, featuring a multimodal condition fusion module and a cross-view object alignment module to enhance object localization and consistency, achieving second place on the benchmark leaderboard.
Authors:Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Abstract:
We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at https://github.com/DCDmllm/Heartcare-Suite .
中文:Heartcare Suite 是一个用于精细心电图分析的多模态框架,包含全面数据集 Heartcare-220K、评估基准 Heartcare-Bench 以及采用定制分词器的 HeartcareGPT,在临床任务中实现了顶尖性能。
English: Heartcare Suite is a multimodal framework for detailed ECG analysis, featuring a comprehensive dataset (Heartcare-220K), an evaluation benchmark (Heartcare-Bench), and HeartcareGPT with specialized tokenization that achieves state-of-the-art performance in clinical tasks.
Authors:Quansong He, Xiangde Min, Kaishen Wang, Tao He
Abstract:
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at https://github.com/nayutayuki/FuseUNet.
中文摘要:本研究提出了一种新颖的多尺度特征融合方法,将UNet解码过程重新定义为求解初值问题,通过自适应常微分方程克服跳跃连接的局限性,同时保持与多种架构的兼容性。
English Summary: The study introduces a novel multi-scale feature fusion method that redefines UNet's decoding process as solving an initial value problem, overcoming limitations in skip connections through adaptive ordinary differential equations while maintaining compatibility with various architectures.
Authors:Ziwei Zhao, Zhixing Zhang, Yuhang Liu, Zhao Zhang, Haojun Yu, Dong Wang, Liwei Wang
Abstract:
In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework. We release the code in https://github.com/barry664/DeformCL
Chinese: DeformCL 提出了一种基于可变形中心线的连续表示方法,用于三维血管提取,具有更好的连通性、抗噪性和交互性,在多个数据集上验证了其优越性能和临床意义。
English: DeformCL introduces a continuous representation using deformable centerlines for 3D blood vessel extraction, offering enhanced connectivity, noise resistance, and interaction capabilities, validated by superior performance on multiple datasets and clinical relevance.
Authors:Yogesh Verma, Amauri H. Souza, Vikas Garg
Abstract:
The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available at https://github.com/Aalto-QuML/PIPE.
中文摘要:研究表明位置编码和持续同调在图神经网络中表达能力相当,由此开发的PiPE新方法在多项任务中超越了二者的性能。
English Summary: The study demonstrates that positional encoding and persistent homology are equally expressive for graph neural networks, leading to the development of PiPE, a novel method that surpasses both in performance across various tasks.
Authors:Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella
Abstract:
We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.
中文: EASG-Bench是一个基于动态场景图构建问答对的以自我为中心视频基准测试,通过系统评估发现现有模型在时序理解方面存在明显差距,为长上下文视频理解领域指明了研究方向。
English: EASG-Bench is a new question-answering benchmark for egocentric videos that uses dynamic scene graphs to generate questions, revealing a performance gap in temporal understanding among current models and highlighting the need for improved long-context video comprehension.
Authors:Wei-Cheng Lin, Chih-Ming Lien, Chen Lo, Chia-Hung Yeh
Abstract:
This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at https://github.com/stevenlin510/GazeNLQ.
中文: 本报告提出GazeNLQ方法,通过利用视线数据增强视频表征,有效提升了自然语言查询对应视频片段的定位精度,在R1@IoU0.3和R1@IoU0.5指标上分别达到27.82和18.68的得分。
English: This report introduces GazeNLQ, a novel method that utilizes gaze data to enhance video representation and improve the accuracy of retrieving video segments corresponding to natural language queries, achieving R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively.
Authors:Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, Julian McAuley
Abstract:
Semantic ID-based recommendation models tokenize each item into a small number of discrete tokens that preserve specific semantics, leading to better performance, scalability, and memory efficiency. While recent models adopt a generative approach, they often suffer from inefficient inference due to the reliance on resource-intensive beam search and multiple forward passes through the neural sequence model. As a result, the length of semantic IDs is typically restricted (e.g. to just 4 tokens), limiting their expressiveness. To address these challenges, we propose RPG, a lightweight framework for semantic ID-based recommendation. The key idea is to produce unordered, long semantic IDs, allowing the model to predict all tokens in parallel. We train the model to predict each token independently using a multi-token prediction loss, directly integrating semantics into the learning objective. During inference, we construct a graph connecting similar semantic IDs and guide decoding to avoid generating invalid IDs. Experiments show that scaling up semantic ID length to 64 enables RPG to outperform generative baselines by an average of 12.6% on the NDCG@10, while also improving inference efficiency. Code is available at: https://github.com/facebookresearch/RPG_KDD2025.
Chinese: 提出的RPG框架通过并行生成无序的长语义ID,提升了推荐性能和推理效率,在NDCG@10指标上平均优于基线模型12.6%。
English: The proposed RPG framework generates long, unordered semantic IDs in parallel to enhance recommendation performance and inference efficiency, achieving a 12.6% average improvement in NDCG@10 over baselines.
Authors:Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka
Abstract:
Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM
Chinese: 本研究通过引入新型数据集评估大型视觉语言模型对视觉错觉的真实感知能力,发现尽管模型能正确回答问题,但其反应可能基于先验知识而非真正的视觉理解。
English: This study introduces a dataset to assess whether large vision language models genuinely perceive visual illusions or rely on prior knowledge, revealing that their responses may not reflect true visual understanding despite correct answers.
Authors:Guang-Xing Li
Abstract:
Physics has been transforming our view of nature for centuries. While combining physical knowledge with computational approaches has enabled detailed modeling of physical systems' evolution, understanding the emergence of patterns and structures remains limited. Correlations between quantities are the most reliable approach to describe relationships between different variables. However, for complex patterns, directly searching for correlations is often impractical, as complexity and spatial inhomogeneity can obscure correlations. We discovered that the key is to search for correlations in local regions and developed a new method, adjacent correlation analysis, to extract such correlations and represent them in phase space. When multiple observations are available, a useful way to study a system is to analyze distributions in phase space using the Probability Density Function (PDF). Adjacent correlation analysis evaluates vectors representing local correlations, which can be overlaid on the PDF plot to form the adjacent correlation plot. These correlation vectors often exhibit remarkably regular patterns and may lead to the discovery of new laws. The vectors we derive are equivalent to the vector field in dynamical systems on the attracting manifold. By efficiently representing spatial patterns as correlation vectors in phase space, our approach opens avenues for classification, prediction, parameter fitting, and forecasting.
Chinese: 本研究提出了相邻相关性分析方法,通过识别复杂系统中的局部相关性并在相空间中以向量形式可视化,揭示了可导致新物理定律的规律性模式,并为分类和预测等应用开辟了新途径。
English: The study introduces adjacent correlation analysis, a method that identifies local correlations in complex systems and visualizes them as vectors in phase space, revealing regular patterns that can lead to new physical laws and applications in classification and prediction.
Authors:Guang-Xing Li
Abstract:
The development of science has been transforming man's view towards nature for centuries. Observing structures and patterns in an effective approach to discover regularities from data is a key step toward theory-building. With increasingly complex data being obtained, revealing regularities systematically has become a challenge. Correlation is a most commonly-used and effective approach to describe regularities in data, yet for complex patterns, spatial inhomogeneity and complexity can often undermine the correlations. We present an algorithm to derive maps representing the type and degree of correlations, by taking the two-fold symmetry of the correlation vector into full account using the Stokes parameter. The method allows for a spatially resolved view of the nature and strength of correlations between physical quantities. In the correlation view, a region can often be separated into different subregions with different types of correlations. Subregions correspond to physical regimes for physical systems, or climate zones for climate maps. The simplicity of the method makes it widely applicable to a variety of data, where the correlation-based approach makes the map particularly useful in revealing regularities in physical systems and alike. As a new and efficient approach to represent data, the method should facilitate the development of new computational approaches to regularity discovery.
中文: 该研究提出一种算法,利用斯托克斯参数生成反映复杂数据中相关类型与强度的图谱,能实现模式的空间解析和子区域划分,有助于在多个领域提升规律发现的效率。
English: The study introduces an algorithm that uses the Stokes parameter to generate maps depicting the type and strength of correlations in complex data, enabling spatial resolution of patterns and subregions for improved regularity discovery in various fields.
Authors:Ruining Sun, Hongsheng Hu, Wei Luo, Zhaoxi Zhang, Yanjun Zhang, Haizhuan Yuan, Leo Yu Zhang
Abstract:
With the rapid advancement of deep learning technology, pre-trained encoder models have demonstrated exceptional feature extraction capabilities, playing a pivotal role in the research and application of deep learning. However, their widespread use has raised significant concerns about the risk of training data privacy leakage. This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models, focusing on contrastive learning frameworks. Through experimental analysis, we reveal the significant impact of model architecture complexity on membership privacy leakage: As more advanced encoder frameworks improve feature-extraction performance, they simultaneously exacerbate privacy-leakage risks. Furthermore, this paper proposes a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA). This method infers membership status, by leveraging the statistical distribution characteristics of the p-norm of feature vectors. Experimental results across multiple datasets and model architectures demonstrate that LpLA outperforms existing methods in attack performance and robustness, particularly under limited attack knowledge and query volumes. This study not only uncovers the potential risks of privacy leakage in contrastive learning frameworks, but also provides a practical basis for privacy protection research in encoder models. We hope that this work will draw greater attention to the privacy risks associated with self-supervised learning models and shed light on the importance of a balance between model utility and training data privacy. Our code is publicly available at: https://github.com/SeroneySun/LpLA_code.
中文: 本文研究了对比学习框架中编码器模型的成员推理攻击隐私风险,发现先进模型架构会加剧隐私泄露,并提出基于p-范数的新型攻击方法LpLA,其性能优于现有方法。
English: This paper investigates privacy risks from membership inference attacks in contrastive learning encoder models, revealing that advanced architectures increase leakage risks while proposing a novel Lp-norm based attack method (LpLA) that outperforms existing approaches.
Authors:Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Yifan Li, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Abstract:
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.
中文摘要:本文提出了一种知识遗忘评估框架,通过知识图谱和大型语言模型作为评判者,更准确地评估大语言模型中已遗忘事实的隐性存留,发现现有方法高估了遗忘效果。
English Summary: This paper introduces a knowledge unlearning evaluation framework that uses knowledge graphs and LLM judges to more accurately assess the implicit persistence of forgotten facts in large language models, revealing that current methods overestimate unlearning effectiveness.
Authors:Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Abstract:
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.
中文摘要:本文提出了一种知识遗忘评估框架,通过知识图谱和大型语言模型作为评判者,更准确地评估大语言模型中已遗忘事实的隐性存留,发现现有方法高估了遗忘效果。
English Summary: This paper introduces a knowledge unlearning evaluation framework that uses knowledge graphs and LLM judges to more accurately assess the implicit persistence of forgotten facts in large language models, revealing that current methods overestimate unlearning effectiveness.
Authors:Fang Wu, Vijay Prakash Dwivedi, Jure Leskovec
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at https://github.com/smiles724/Rel-LLM.
Chinese: Rel-LLM提出了一种新颖架构,通过基于图神经网络的编码器在检索增强生成框架中生成结构化关系提示,既保留了数据库的内在关系结构,又使大语言模型能有效处理复杂实体关系,在关键关系深度学习任务上超越了现有方法。
English: Rel-LLM introduces a novel architecture that uses a GNN-based encoder to generate structured relational prompts within a RAG framework, preserving database structures and enabling LLMs to effectively reason over complex entity relationships, outperforming existing methods on key RDL tasks.
Authors:Dumindu Tissera, Omar Awadallah, Muhammad Umair Danish, Ayan Sadhu, Katarina Grolinger
Abstract:
Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: https://github.com/ML-for-Sensor-Data-Western/gmean-mlc
中文摘要:本研究通过归一化加权几何均值重构多标签分类损失函数,有效处理负数据,在不增加参数或计算复杂度的情况下提升了分类性能。
English Summary: This study redesigns standard multi-label classification loss functions using a normalized weighted geometric mean to handle negative data, improving performance without extra parameters or computational cost.
Authors:Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei
Abstract:
Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter's marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at https://github.com/zwebzone/coto.
中文: CoTo是一种渐进式训练策略,通过逐步提高适配器激活概率来增强LoRA,促进更均衡的优化和更广泛的损失空间探索,从而提升模型泛化能力和下游任务表现。
English: CoTo is a progressive training strategy that enhances LoRA by gradually increasing adapter activation probability, promoting balanced optimization and broader loss landscape exploration to improve generalization and downstream task performance.
Authors:Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
Abstract:
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
Chinese: GraphRAG-Bench是一个全面基准,通过测试不同难度任务中的层次知识检索和上下文推理,旨在评估GraphRAG何时及为何优于传统RAG。
English: GraphRAG-Bench is a comprehensive benchmark introduced to evaluate when and why GraphRAG outperforms traditional RAG by testing hierarchical knowledge retrieval and contextual reasoning across tasks of varying difficulty.
Authors:Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
Abstract:
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
Chinese: GraphRAG-Bench是一个全面基准,通过测试不同难度任务中的层次知识检索和上下文推理,旨在评估GraphRAG何时及为何优于传统RAG。
English: GraphRAG-Bench is a comprehensive benchmark introduced to evaluate when and why GraphRAG outperforms traditional RAG by testing hierarchical knowledge retrieval and contextual reasoning across tasks of varying difficulty.
Authors:Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
Abstract:
Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.
中文摘要:本文提出BAQ量化框架,通过基于Hessian的敏感度指标和凸优化实现最优位宽分配,在多种规模的大语言模型上相比GPTQ实现高达56倍的困惑度降低。
English Summary: This paper introduces BAQ, a novel quantization framework that optimally allocates bitwidths using Hessian-based sensitivity metrics and convex optimization, significantly outperforming GPTQ with up to 56× lower perplexity across various LLM sizes.
Authors:Md Jueal Mia, M. Hadi Amini
Abstract:
Federated Learning (FL) offers a decentralized framework for training and fine-tuning Large Language Models (LLMs) by leveraging computational resources across organizations while keeping sensitive data on local devices. It addresses privacy and security concerns while navigating challenges associated with the substantial computational demands of LLMs, which can be prohibitive for small and medium-sized organizations. FL supports the development of task-specific LLMs for cross-silo applications through fine-tuning but remains vulnerable to inference attacks, such as membership inference and gradient inversion, which threaten data privacy. Prior studies have utilized Differential Privacy (DP) in LLM fine-tuning, which, despite being effective at preserving privacy, can degrade model performance. To overcome these challenges, we propose a novel method, FedShield-LLM, that uses pruning with Fully Homomorphic Encryption (FHE) for Low-Rank Adaptation (LoRA) parameters, enabling secure computations on encrypted model updates while mitigating the attack surface by deactivating less important LoRA parameters. Furthermore, optimized federated algorithms for cross-silo environments enhance scalability and efficiency. Parameter-efficient fine-tuning techniques like LoRA substantially reduce computational and communication overhead, making FL feasible for resource-constrained clients. Experimental results show that the proposed method outperforms existing methods while maintaining robust privacy protection, enabling organizations to collaboratively train secure and efficient LLMs.
The code and data are available at, https://github.com/solidlabnetwork/fedshield-llm
中文摘要:联邦学习为大型语言模型提供了去中心化训练框架,而提出的FedShield-LLM方法通过结合全同态加密和LoRA参数剪枝技术,在保护数据隐私的同时提升了模型性能,优于现有方法。
English Summary: Federated Learning enables decentralized training of Large Language Models while preserving data privacy, and the proposed FedShield-LLM method enhances security through pruning with Fully Homomorphic Encryption for LoRA parameters, outperforming existing approaches.
Authors:Aaron Schild, Sreenivas Gollapudi, Anupam Gupta, Kostas Kollias, Ali Sinop
Abstract:
Users of routing services like Apple Maps, Google Maps, and Waze frequently wonder why a given route is proposed. This question particularly arises when dynamic conditions like traffic and road closures cause unusual routes to be proposed. While many dynamic conditions may exist in a road network at any time, only a small fraction of those conditions are typically relevant to a given user's route. In this work, we introduce the concept of a simple valid explanation (SVE), which consists of a small set of traffic-laden road segments that answer the following question: Which traffic conditions cause a particular shortest traffic-aware route to differ from the shortest traffic-free route? We give an efficient algorithm for finding SVEs and show that they theoretically and experimentally lead to small and interpretable answers to the question.
Chinese: 本文提出简单有效解释(SVE)概念,通过识别导致交通感知最短路径偏离无交通最短路径的关键路段,为导航系统路线选择提供高效且易于理解的解释。
English: The paper introduces Simple Valid Explanations (SVEs) to identify key traffic conditions that cause routing services to deviate from traffic-free shortest paths, providing efficient and interpretable justifications for route choices.
Authors:Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish
Abstract:
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area.
In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
中文:MMTU是一个包含超过3万个问题、涵盖25种专家级表格任务的大规模基准,旨在全面评估模型对真实表格的理解、推理和操作能力,结果显示即使顶级模型得分也仅约60%,表明仍有巨大改进空间。
English: MMTU is a large-scale benchmark with over 30,000 questions across 25 expert-level table tasks, designed to comprehensively evaluate models' ability to understand, reason, and manipulate real tables, revealing significant room for improvement as even top models score only around 60%.
Authors:Vlastimil Martinek, Andrea Gariboldi, Dimosthenis Tzimotoudis, Aitor Alberdi Escudero, Edward Blake, David Cechak, Luke Cassar, Alessandro Balestrucci, Panagiotis Alexiou
Abstract:
The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model-based agents have shown promise for automating end-to-end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics-ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state-of-the-art agent-based methods in both generalization and success rates. While state-of-the-art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics-ML narrows the gap for fully autonomous systems and achieves state-of-the-art performance on one of the used benchmark datasets. The code is available at https://github.com/BioGeMT/Agentomics-ML.
中文: Agentomics-ML是一种基于智能体的全自动系统,通过自动化机器学习实验流程在计算生物学数据集上实现了更好的泛化能力和成功率,显著缩小了与专家构建模型之间的性能差距。
English: Agentomics-ML is an autonomous agent-based system that automates end-to-end machine learning experimentation for computational biology datasets, outperforming existing agent methods in generalization and success rates while narrowing the performance gap with expert-built models.
Authors:Jie Cai, Kangning Yang, Ling Ouyang, Lan Fu, Jiaming Ding, Jinglin Shen, Zibo Meng
Abstract:
Removing reflections is a crucial task in computer vision, with significant applications in photography and image enhancement. Nevertheless, existing methods are constrained by the absence of large-scale, high-quality, and diverse datasets. In this paper, we present a novel benchmark for Single Image Reflection Removal (SIRR). We have developed a large-scale dataset containing 5,300 high-quality, pixel-aligned image pairs, each consisting of a reflection image and its corresponding clean version. Specifically, the dataset is divided into two parts: 5,000 images are used for training, and 300 images are used for validation. Additionally, we have included 100 real-world testing images without ground truth (GT) to further evaluate the practical performance of reflection removal methods. All image pairs are precisely aligned at the pixel level to guarantee accurate supervision. The dataset encompasses a broad spectrum of real-world scenarios, featuring various lighting conditions, object types, and reflection patterns, and is segmented into training, validation, and test sets to facilitate thorough evaluation. To validate the usefulness of our dataset, we train a U-Net-based model and evaluate it using five widely-used metrics, including PSNR, SSIM, LPIPS, DISTS, and NIQE. We will release both the dataset and the code on https://github.com/caijie0620/OpenRR-5k to facilitate future research in this field.
中文: 本文提出了一个包含5,300组像素级对齐图像对的大规模单图像反射去除基准数据集,解决了现有方法因缺乏高质量数据而受限的问题,并通过多种评估指标验证了其有效性。
English: This paper introduces a large-scale benchmark dataset of 5,300 pixel-aligned image pairs for Single Image Reflection Removal, addressing the limitations of existing methods and enabling comprehensive evaluation across diverse real-world scenarios.
Authors:Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, Zhaoxiang Zhang
Abstract:
Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with new model abilities. Methodologically, we propose preventing catastrophic interference through parameter isolation and an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods. Our benchmark and code are available at https://github.com/bjzhb666/MLLM-CL.
中文: 多模态大语言模型在适应动态环境中的新知识方面面临挑战,为此提出的MLLM-CL基准和方法通过参数隔离和路由机制,在持续学习场景中显著减少遗忘并提升性能。
English: Multimodal Large Language Models struggle with adapting to new knowledge in dynamic environments, leading to the development of MLLM-CL, a benchmark and method that uses parameter isolation and routing to minimize forgetting and enhance performance in continual learning scenarios.
Authors:Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Milind Naphade, Sambit Sahu, Irina Rish, Ekaterina Lobacheva
Abstract:
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
中文摘要:本研究发现语言模型扩展通过降低损失减速发生的临界点并改善减速后的学习效率,来缓解早期训练中的损失减速现象,该现象源于零和学习机制中的梯度相互抵消问题。
English Summary: This study reveals that language model scaling mitigates early training loss deceleration by reducing its onset loss and improving post-deceleration learning rates, attributing this phenomenon to zero-sum learning dynamics where gradient interference limits progress.
Authors:Kimberley M. Bird, Xujiong Ye, Alan M. Race, James M. Brown
Abstract:
Registration of histological and mass spectrometry imaging (MSI) allows for more precise identification of structural changes and chemical interactions in tissue. With histology and MSI having entirely different image formation processes and dimensionalities, registration of the two modalities remains an ongoing challenge. This work proposes a solution that synthesises histological images from MSI, using a pix2pix model, to effectively enable unimodal registration. Preliminary results show promising synthetic histology images with limited artifacts, achieving increases in mutual information (MI) and structural similarity index measures (SSIM) of +0.924 and +0.419, respectively, compared to a baseline U-Net model. Our source code is available on GitHub: https://github.com/kimberley/MIUA2025.
中文: 本研究提出了一种基于pix2pix的方法,可从质谱成像数据合成组织学图像,实现有效的单模态配准,并在互信息和结构相似性指标上展现出优于基准模型的性能提升。
English: This study introduces a pix2pix-based method to synthesize histological images from mass spectrometry imaging, enabling effective unimodal registration and demonstrating improved mutual information and structural similarity over baseline models.
Authors:Ludovic Arnould, Salim Khazem, Hugues Ali Mehenni
Abstract:
Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.
Chinese: 该摘要提出了一种新的视觉语言模型评估方法,通过程序生成合成图像来系统性地测试并精确识别感知缺陷,超越了依赖标注真实图像和总体准确率的传统基准测试。
English: The abstract proposes a new diagnostic evaluation method for Visual Language Models (VLMs) using procedurally generated synthetic images to systematically test and precisely identify perception failures, moving beyond traditional benchmarks that rely on annotated real images and aggregate accuracy scores.
Authors:Zikang Liu, Tongtian Yue, Yepeng Tang, Longteng Guo, Junxian Cai, Qingbin Liu, Xi Chen, Jing Liu
Abstract:
Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper
中文摘要:Prefix Grouper是一种高效训练算法,通过重构自注意力机制消除GRPO中的冗余前缀计算,在保持性能不变的同时显著降低计算成本。
English Summary: Prefix Grouper is an efficient training algorithm that eliminates redundant prefix computations in Group Relative Policy Optimization (GRPO) by restructuring self-attention, maintaining identical performance while significantly reducing computational costs.
Authors:Zishan Shu, Yufan Deng, Hongyu Zhang, Zhiwei Nie, Jie Chen
Abstract:
Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https://github.com/ZishanShu/MTPNet.
Chinese: MTPNet通过多粒度蛋白质语义引导动态优化分子表征,提出了统一的活性悬崖预测框架,在30个代表性数据集上显著超越现有方法,平均预测精度提升18.95%。
English: MTPNet introduces a unified framework for activity cliff prediction by dynamically optimizing molecular representations through multi-grained protein semantic guidance, significantly outperforming existing methods and improving prediction accuracy by 18.95% on average across 30 datasets.
Authors:Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Abstract:
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbf{T2MIR} (\textbf{T}oken- and \textbf{T}ask-wise \textbf{M}oE for \textbf{I}n-context \textbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
中文:T2MIR框架通过引入标记级和任务级混合专家机制到基于Transformer的决策模型中,有效解决了情境强化学习中的多模态和任务异质性挑战,并在实验中展现出优于各类基准方法的性能。
English: The T2MIR framework introduces token- and task-wise mixture-of-experts to transformer-based decision models, effectively addressing multi-modality and task heterogeneity in in-context reinforcement learning while demonstrating superior performance over existing methods.
Authors:Seunghwan Shin, Yusung Kim
Abstract:
In the field of Multi-Person Pose Estimation (MPPE), Radio Frequency (RF)-based methods can operate effectively regardless of lighting conditions and obscured line-of-sight situations. Existing RF-based MPPE methods typically involve either 1) converting RF signals into heatmap images through complex preprocessing, or 2) applying a deep embedding network directly to raw RF signals. The first approach, while delivering decent performance, is computationally intensive and time-consuming. The second method, though simpler in preprocessing, results in lower MPPE accuracy and generalization performance. This paper proposes an efficient and lightweight one-stage MPPE model based on raw RF signals. By sub-grouping RF signals and embedding them using a shared single-layer CNN followed by multi-head attention, this model outperforms previous methods that embed all signals at once through a large and deep CNN. Additionally, we propose a new self-supervised learning (SSL) method that takes inputs from both one unmasked subgroup and the remaining masked subgroups to predict the latent representations of the masked data. Empirical results demonstrate that our model improves MPPE accuracy by up to 15 in PCKh@0.5 compared to previous methods using raw RF signals. Especially, the proposed SSL method has shown to significantly enhance performance improvements when placed in new locations or in front of obstacles at RF antennas, contributing to greater performance gains as the number of people increases. Our code and dataset is open at Github. https://github.com/sshnan7/SOSPE .
中文: 本文提出了一种基于原始射频信号的轻量级单阶段多人姿态估计模型,通过分组信号并采用共享CNN与多头注意力机制,在精度和泛化性能上优于现有方法,同时结合自监督学习有效提升了在陌生环境和存在障碍物时的表现。
English: This paper introduces a lightweight one-stage model for multi-person pose estimation using raw RF signals, which groups signals and employs a shared CNN with multi-head attention to achieve higher accuracy and better generalization than previous methods, alongside a self-supervised learning approach that enhances performance in new environments and with obstacles.
Authors:Jeongsoo Ha, Kyungsoo Kim, Yusung Kim
Abstract:
Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at https://github.com/JeongsooHa/DrG.git
中文摘要:基于模型的强化学习在视觉干扰下表现不佳,而本研究提出的自监督方法Dr. G通过双对比学习和循环状态逆动力学模型捕捉任务相关特征,在零样本泛化测试中显著提升了抗干扰能力和性能。
English Summary: Model-based reinforcement learning (MBRL) often struggles with visual distractions, but the proposed self-supervised method Dr. G uses dual contrastive learning and a recurrent state inverse dynamics model to enhance robustness, achieving significant performance improvements in zero-shot generalization tests.
Authors:Kyungsoo Kim, Jeongsoo Ha, Yusung Kim
Abstract:
Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at https://github.com/unigary/SPD.
中文:自预测动态(SPD)方法通过双向增强预测状态转换,能有效提取含干扰图像中的任务相关特征,在视觉控制任务中表现优异,并显著提升对未见观察的泛化能力。
English: The Self-Predictive Dynamics (SPD) method effectively extracts task-relevant features from images with distractions by using dual augmentations to predict transitions, demonstrating superior performance in visual control tasks and enhanced generalization for unseen observations.
Authors:Patrik Czakó, Gábor Kertész, Sándor Szénási
Abstract:
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
中文: SmoothRot是一种新颖的训练后量化技术,通过将激活异常值转化为量化友好形式,显著提升大型语言模型的4位量化效率,在不增加延迟的情况下将性能差距缩小10-30%。
English: SmoothRot is a novel post-training quantization technique that enhances 4-bit quantization efficiency in Large Language Models by transforming activation outliers into quantization-friendly forms, significantly reducing performance gaps by 10-30% without added latency.
Authors:Jianqing Zhang, Yang Liu, Jie Fu, Yang Hua, Tianyuan Zou, Jian Cao, Qiang Yang
Abstract:
The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP's utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at https://github.com/TsingZ0/PCEvolve.
中文: 提出的私有对比进化算法通过挖掘少样本私有数据中的类间对比关系并将其融入改进的指数机制,有效克服了现有方法的局限,显著提升了专业领域差分隐私合成图像的生成质量。
English: The proposed Private Contrastive Evolution (PCEvolve) algorithm overcomes the limitations of existing methods by mining inter-class contrastive relationships in few-shot private data and integrating them into an adapted Exponential Mechanism, significantly improving the quality of differentially private synthetic images in specialized domains.
Authors:Caleb Zheng, Eli Shlizerman
Abstract:
Diffusion models achieve realistic outcomes across a wide range of generative tasks, but their high computational cost remains a major barrier to deployment. Model pruning has emerged as a promising strategy to reduce inference cost and enable lightweight diffusion models. While effective, pruned diffusion models are proned to quality reduction due to limited capacity. A key limitation of current pruning approaches is that pruned models are finetuned using the same objective as the dense model, typically denoising score matching (DSM). Since the dense model is accessible during finetuning, it warrants a more effective approach for knowledge transfer from the dense to the pruned model. Motivated by this aim, we revisit the finetuning stage and propose IGSM (\textbf{I}mproved \textbf{G}eometric and \textbf{S}ensitivity \textbf{M}atching), a general-purpose finetuning framework that introduces a second-order Jacobian projection loss inspired by Finite-Time Lyapunov Exponents (FTLE). IGSM efficiently captures and aligns the geometric and the temporal dynamics of pruned models with their dense teachers using scalable second-order projections. Our approach is architecture-agnostic and applies to both U-Net- and Transformer-based diffusion models. Experiments on CIFAR-10, CelebA, LSUN-Church, and LSUN-Bedroom show that IGSM consistently narrows the performance gap between pruned and dense models, substantially improving sample quality. Code is available on GitHub: https://github.com/FATE4869/IGSM-Official
中文: 扩散模型计算成本高,剪枝虽能降低成本但易导致质量下降;本文提出IGSM微调框架,利用二阶雅可比投影对齐剪枝模型与原始密集模型的几何和时间动态,在多种数据集和架构上显著提升性能。
English: Diffusion models face high computational costs, and while pruning reduces this, it often degrades quality; this paper introduces IGSM, a finetuning framework that uses second-order Jacobian projections to align pruned models with dense ones, improving performance across various datasets and architectures.
Authors:Luka Vetoshkin, Dmitry Yudin
Abstract:
Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9\% IoU and +8.3\% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: https://github.com/richlukich/Talk2SAM
中文摘要:Talk2SAM通过引入文本引导改进了复杂形状物体的分割精度,相比SAM-HQ模型实现了显著的性能提升。
English Summary: Talk2SAM enhances object segmentation by integrating textual guidance to improve accuracy on complex shapes, achieving significant performance gains over existing models like SAM-HQ.
Authors:Hondamunige Prasanna Silva, Federico Becattini, Lorenzo Seidenari
Abstract:
Foundation models represent the most prominent and recent paradigm shift in artificial intelligence. Foundation models are large models, trained on broad data that deliver high accuracy in many downstream tasks, often without fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers (ViT), are becoming the bedrock of many industrial AI-powered applications. However, the reliance on pre-trained foundation models also introduces significant security concerns, as these models are vulnerable to adversarial attacks. Such attacks involve deliberately crafted inputs designed to deceive AI systems, jeopardizing their reliability. This paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs, and explores the transferability of adversarial attacks to downstream tasks. We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic fashion. We demonstrate the effectiveness of our attack on several downstream tasks: classification, captioning, image/text retrieval, segmentation and depth estimation. Code available at:https://github.com/HondamunigePrasannaSilva/attack-attention
基础模型通过为多种任务提供通用且高精度的解决方案正在革新人工智能领域,然而它们对对抗性攻击的脆弱性带来了严重的安全隐患,本研究聚焦CLIP和视觉变换器,通过一种新型的针对变换器结构的攻击验证了这一点。
Foundation models are revolutionizing AI by providing versatile, high-accuracy solutions across various tasks, yet their vulnerability to adversarial attacks poses serious security risks, as demonstrated in this study focusing on CLIP and ViTs with a novel transformer-targeted attack.
Authors:Shenyang Huang, Ali Parviz, Emma Kondrup, Zachary Yang, Zifeng Ding, Michael Bronstein, Reihaneh Rabbany, Guillaume Rabusseau
Abstract:
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
中文: 本文提出的TGTalker创新框架使大语言模型能在真实时序图上实现具有竞争力的链接预测,并通过生成文本解释显著提升了模型的可解释性。
English: This paper introduces TGTalker, a novel framework that enables Large Language Models to perform competitive link prediction on real-world temporal graphs while generating textual explanations for enhanced interpretability.
Authors:Abu Sufian, Marco Leo, Cosimo Distante, Anirudha Ghosh, Debaditya Barman
Abstract:
Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: https://github.com/Sufianlab/FairVitBio.
中文摘要:本研究通过结合预训练模型的全局特征与局部特征,采用新型小样本原型网络评估了Vision Transformer和ResNet在不同人口群体间的公平人脸认证效果,其中微软Swin Transformer表现最佳。
English Summary: This study evaluates Vision Transformers and ResNet for fair biometric face authentication across demographic groups by combining global features from pre-trained models with local features through a novel few-shot prototype network, with the Microsoft Swin Transformer showing superior performance.
Authors:George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman
Abstract:
Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.
Chinese: 对比流匹配通过强制不同条件流之间的独特性,显著提升了条件扩散模型的训练速度、减少了去噪步骤,并降低了FID分数。
English: Contrastive Flow Matching enhances conditional diffusion models by enforcing uniqueness across flows, significantly improving training speed, reducing denoising steps, and lowering FID scores compared to standard flow matching.
Authors:Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
Abstract:
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.
Chinese: 本研究揭示了多模态大语言模型中仅少量注意力头对视觉处理至关重要,并提出了SparseMM这一无需训练的优化策略,通过集中计算资源于关键视觉头,在保持性能的同时显著提升推理速度。
English: This study reveals that only a small fraction of attention heads in multimodal large language models are crucial for visual processing and introduces SparseMM, a training-free optimization strategy that accelerates inference while preserving performance by focusing computation on these key visual heads.
Authors:Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar
Abstract:
Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.
中文: 语言模型在偏好评估中表现出系统性偏差,过度依赖长度、结构等表面特征,但通过反事实数据增强的微调方法,可将平均误校准率从39.4%降至32.5%,同时保持整体性能,有效提升模型可靠性。
English: Language models exhibit miscalibration by favoring superficial features like length and style over substantive qualities, but this can be mitigated through counterfactual data augmentation, reducing miscalibration by nearly 7% and skew by over 10% while preserving performance.
Authors:Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Fahad Shahbaz Khan, Salman Khan
Abstract:
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.
中文: VideoMolmo是一种新型多模态模型,通过结合时序注意力和掩码融合技术,显著提升了视频分析中的时空定位精度与推理能力。
English: VideoMolmo is a novel multimodal model that enhances spatio-temporal localization by integrating temporal attention and mask fusion for improved accuracy and reasoning in video analysis.
Authors:Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Abstract:
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.
中文: 本研究推出了Search Arena这一大规模人类偏好数据集,用于评估搜索增强语言模型,发现用户偏好受引用数量和来源类型影响,交叉分析表明网络搜索能提升非搜索场景性能,但仅依赖参数知识会显著降低搜索密集型任务的质量。
English: This study introduces Search Arena, a large-scale human-preference dataset for evaluating search-augmented language models, revealing that user preferences are influenced by citation quantity and source type, while cross-analysis shows web search enhances performance in non-search settings but reliance on parametric knowledge alone degrades search-intensive results.
Authors:Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Abstract:
We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.
中文摘要:本研究提出动力学缩放定律,揭示因忽略内存瓶颈而高估了小模型效能,并通过稀疏注意力机制降低单令牌成本、支持生成长文本,从而优化资源分配。
English Summary: This study introduces the Kinetics Scaling Law, which demonstrates that smaller models' effectiveness is overestimated due to overlooked memory bottlenecks, and proposes sparse attention to optimize resource allocation by reducing per-token costs and enabling longer generations.
Authors:Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
Abstract:
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
中文: MINT-CoT通过自适应地将视觉标记嵌入文本推理步骤,显著提升了多模态数学推理能力,并在多个基准测试中取得了卓越的性能提升。
English: MINT-CoT enhances multimodal mathematical reasoning by adaptively interleaving visual tokens into textual reasoning steps, achieving significant performance improvements across benchmarks.
Authors:Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, Gokul Swamy
Abstract:
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $\texttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$\times$ still leaves a performance gap. We find that $\texttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .
行为克隆(BC)仅能模仿专家在已演示状态下的行为,缺乏错误恢复能力,而我们提出的L2S方法通过习得世界模型和奖励模型,使智能体具备独立规划与纠错能力,在多项基准测试中显著优于BC方法。
Behavioral cloning (BC) is limited to imitating expert actions only within demonstrated states, lacking recovery strategies for errors, whereas our proposed L2S approach enables agents to plan and recover independently using learned world and reward models, significantly outperforming BC methods across multiple benchmarks.
Authors:Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Abstract:
Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
Chinese: 本文提出对齐分词器(AliTok),通过因果解码器建立图像令牌间的单向依赖关系,与自回归模型保持一致,在实现优异重建效果的同时大幅提升了生成速度。
English: This paper introduces Aligned Tokenizer (AliTok), which uses a causal decoder to create unidirectional dependencies in image tokens, aligning with autoregressive models and achieving superior reconstruction and generation performance with faster sampling speeds.
Authors:Nan Wang, Yuantao Chen, Lixing Xiao, Weiqing Xiao, Bohan Li, Zhaoxi Chen, Chongjie Ye, Shaocong Xu, Saining Zhang, Ziyang Yan, Pierre Merriaux, Lei Lei, Tianfan Xue, Hao Zhao
Abstract:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.
中文摘要:本文提出了一种多尺度双边网格,将外观编码与双边网格相结合,显著提升了动态自动驾驶场景重建的几何精度,通过有效减少光度不一致导致的漂浮物,在多个数据集上优于现有方法。
English Summary: This paper introduces a multi-scale bilateral grid that combines appearance codes and bilateral grids to enhance geometric accuracy in dynamic autonomous driving scene reconstruction, outperforming existing methods across multiple datasets by reducing floaters from photometric inconsistencies.
Authors:Xiaodong Wang, Jinfa Huang, Li Yuan, Peixi Peng
Abstract:
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log Ï_θ(y_w\mid x)$ and $\log Ï_θ(y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.
中文摘要:本文提出LeanPO方法,通过重新定义隐式奖励并采用自生成偏好数据和动态标签平滑策略,有效缓解视频大语言模型中非目标响应概率意外下降的问题,以最小训练开销显著提升模型性能。
English Summary: This paper introduces LeanPO, a reference-free preference optimization method that addresses the unintended probability drop in Video-LLMs by redefining implicit rewards and using self-generated preference data with dynamic label smoothing, significantly improving model performance with minimal training overhead.
Authors:Jianghao Wu, Yicheng Wu, Yutong Xie, Wenjia Bai, You Zhang, Feilong Tang, Yulong Li, Yasmeen George, Imran Razzak
Abstract:
Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at https://github.com/JianghaoWu/SAM-TTA.
Chinese Summary: 针对通用医学图像分割中SAM模型适应性不足的问题,本文提出SAM-TTA测试时适应框架,通过自适应图像转换和双尺度一致性学习,在保持模型泛化能力的同时显著提升医学图像分割性能。
English Summary: The Segment Anything Model (SAM) struggles with medical image segmentation due to domain gaps, but the proposed SAM-TTA framework overcomes this by adaptively transforming inputs and aligning representations at test time, achieving superior performance without compromising generalization.
Authors:Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai
Abstract:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.
Chinese: MonkeyOCR采用创新的结构-识别-关系三元范式,将文档解析分解为布局分析、内容识别和逻辑排序,以仅30亿参数的紧凑模型在精度和速度上超越更大模型,实现了最先进的性能。
English: MonkeyOCR introduces a novel Structure-Recognition-Relation triplet paradigm that simplifies document parsing by decomposing it into layout analysis, content identification, and logical ordering, achieving state-of-the-art performance with a compact 3B-parameter model that surpasses larger models in both accuracy and speed.
Authors:Nathan Herr, Tim Rocktäschel, Roberta Raileanu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.
Chinese Summary: LLM优先搜索(LFS)是一种新型的自引导搜索方法,它通过大语言模型的内部评分机制自主控制搜索过程,无需预定义策略,并在复杂推理任务中展现出更优的性能、效率和可扩展性。
English Summary: LLM-First Search (LFS) is a novel self-guided search method that enables large language models to autonomously control the search process through internal scoring, eliminating the need for predefined strategies and demonstrating superior performance, efficiency, and scalability on challenging reasoning tasks.
Authors:Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, Xingping Dong
Abstract:
Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.
中文: 本研究发现在具身三维物体定位中,检测模型无需专门训练即可超越语言指导的定位模型,据此提出DEGround框架,通过共享检测与定位的查询表示及语言增强模块,在验证集上以7.52%的准确率优势刷新了最佳性能。
English: This study reveals that detection models can outperform specialized grounding models in embodied 3D object localization, leading to the development of DEGround—a unified framework that enhances grounding through shared object queries and language-aware modules, achieving a 7.52% accuracy improvement over state-of-the-art methods.
Authors:Moritz Miller, Bernhard Schölkopf, Siyuan Guo
Abstract:
Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .
中文: 研究表明大规模语言模型能够通过噪声溯因在线性回归任务中进行上下文反事实推理,其性能受自注意力机制、模型深度和预训练数据多样性影响,并显示出在反事实故事生成等领域的应用潜力。
English: This research demonstrates that large-scale language models can perform in-context counterfactual reasoning through noise abduction in linear regression tasks, with performance driven by self-attention mechanisms, model depth, and pre-training data diversity, showing potential for applications like counterfactual story generation.
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang
Abstract:
Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.
中文: 提出的TreeRPO方法通过树采样估计步骤级奖励来增强大语言模型的推理能力,在准确性和效率上相比现有方法均有显著提升。
English: The proposed TreeRPO method enhances LLM reasoning by estimating step-level rewards through tree sampling, significantly improving accuracy and efficiency over existing approaches.
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang
Abstract:
Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.
中文: 提出的TreeRPO方法通过树采样估计步骤级奖励来增强大语言模型的推理能力,在准确性和效率上相比现有方法均有显著提升。
English: The proposed TreeRPO method enhances LLM reasoning by estimating step-level rewards through tree sampling, significantly improving accuracy and efficiency over existing approaches.
Authors:Shivani Upadhyay, Messiah Ataey, Syed Shariyar Murtaza, Yifan Nie, Jimmy Lin
Abstract:
The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-dimensional plots, leading to hallucinations and erroneous outputs. This paper explores the capabilities of LLMs and MLLMs in understanding and answering questions from complex data structures found in PDF documents by leveraging industrial and open-source tools as part of a pre-processing pipeline. Our findings indicate that GPT-4o, a popular MLLM, achieves an accuracy of 56% on multi-structured documents when fed documents directly, and that integrating pre-processing tools raises the accuracy of LLMs to 61.3% for GPT-4o and 76% for GPT-4, and with lower overall cost. The code is publicly available at https://github.com/OGCDS/FinancialQA.
中文: 本研究探讨了大型语言模型和多模态大语言模型处理PDF中复杂结构化数据的能力,发现预处理工具能显著提升准确率并降低成本,其中GPT-4的准确率可达76%。
English: This study examines how LLMs and MLLMs handle complex structured data from PDFs, finding that pre-processing tools significantly boost accuracy and reduce costs, with GPT-4 reaching 76% accuracy.
Authors:Weicheng Gao
Abstract:
After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Through-the-Wall-Radar-Human-Activity-Recognition-Without-Using-Neural-Networks.
中文摘要:本文提出一种不使用神经网络的穿墙雷达人体活动识别方法,通过信号处理技术提取微多普勒特征,并利用拓扑相似性与模板数据进行比对实现智能识别。
English Summary: This paper proposes a neural network-free approach for through-the-wall radar human activity recognition by extracting micro-Doppler signatures through signal processing techniques and comparing them with template data using topological similarity analysis.
Authors:Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang
Abstract:
Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.
Chinese: ECoRAG框架通过基于证据性压缩检索文档并在证据不足时迭代补充内容,显著提升大语言模型在开放域问答中的性能,相比现有方法在保证准确性的同时实现了更高的成本效益。
English: The ECoRAG framework enhances LLM performance in Open-Domain Question Answering by compressing retrieved documents based on evidentiality and iteratively retrieving additional content if evidence is insufficient, achieving both higher accuracy and cost efficiency compared to existing methods.
Authors:Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv
Abstract:
Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.
中文摘要:Knowledgeable-R1是一个强化学习框架,通过训练大语言模型利用参数化知识来抵抗误导性检索信息,在知识冲突场景中显著提升了鲁棒性和推理准确性。
English Summary: Knowledgeable-R1 is a reinforcement learning framework that trains large language models to resist misleading retrieved information by leveraging their parametric knowledge, significantly improving robustness and accuracy in knowledge conflict scenarios.
Authors:Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lyu
Abstract:
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors. We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful. Knowledgeable-R1 introduces a joint sampling scheme that generates paired responses with and without retrieval, and learns both local advantages (within each decoding regime) and global advantages under the same input to quantify when to ignore misleading context versus adopt it. We employ an asymmetric advantage transformation that amplifies exploratory behaviors toward parametric knowledge. Experiments show that \method significantly improves robustness and reasoning accuracy in knowledge conflict scenarios and general RAG scenarios, outperforming SOTA baselines by 23% in counterfactual scenarios, and without degradation when the retrieved context is fully accurate.Our code are available at https://github.com/lcy80366872/knowledgeable-R1.
中文摘要:Knowledgeable-R1是一个强化学习框架,通过训练大语言模型利用参数化知识来抵抗误导性检索信息,在知识冲突场景中显著提升了鲁棒性和推理准确性。
English Summary: Knowledgeable-R1 is a reinforcement learning framework that trains large language models to resist misleading retrieved information by leveraging their parametric knowledge, significantly improving robustness and accuracy in knowledge conflict scenarios.
Authors:Benedikt Hopf, Radu Timofte
Abstract:
Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at https://github.com/BenediktHopf/PMM
中文摘要:本研究提出了一种实用篡改模型(PMM),通过扩展伪造技术并在训练中加入图像退化处理,显著提升了深度伪造检测的鲁棒性及在基准数据集上的性能表现。
English Summary: The study introduces a Practical Manipulation Model (PMM) that expands forgery techniques and incorporates image degradations during training, significantly boosting deepfake detection robustness and performance on benchmark datasets.
Authors:Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim
Abstract:
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
Chinese: 我们提出辩论演讲评估作为测试大语言模型评判能力的新基准,发现先进模型虽能在某些方面接近人类判断水平,但其整体评估行为与人类存在显著差异。
English: We propose Debate Speech Evaluation as a new benchmark to test LLM judges' ability to assess complex aspects like argument strength and coherence, revealing that while advanced models can match human judgments in certain areas, their overall evaluation behavior differs significantly.
Authors:Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Amol Verma, Fahad Razak, Rahul G. Krishnan
Abstract:
The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
Chinese: 本文提出D3M监控算法,通过预测模型间的分歧来检测动态环境中机器学习模型的性能退化,在非退化偏移下保持低误报率,在退化偏移下实现高检测率,经实验验证可作为高风险机器学习流程的有效预警机制。
English: This paper introduces D3M, a monitoring algorithm that detects when machine learning models need retraining due to performance degradation in dynamic environments, effectively reducing false alarms while ensuring high detection rates for actual deterioration.
Authors:Viet Nguyen, Changjian Shui, Vijay Giri, Siddarth Arya, Michael Cooper, Amol Verma, Fahad Razak, Rahul G. Krishnan
Abstract:
The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
Chinese: 本文提出D3M监控算法,通过预测模型间的分歧来检测动态环境中机器学习模型的性能退化,在非退化偏移下保持低误报率,在退化偏移下实现高检测率,经实验验证可作为高风险机器学习流程的有效预警机制。
English: This paper introduces D3M, a monitoring algorithm that detects when machine learning models need retraining due to performance degradation in dynamic environments, effectively reducing false alarms while ensuring high detection rates for actual deterioration.
Authors:Nicolas Lell, Ansgar Scherp
Abstract:
Shallow node embeddings like node2vec (N2V) can be used for nodes without features or to supplement existing features with structure-based information. Embedding methods like N2V are limited in their application on new nodes, which restricts them to the transductive setting where the entire graph, including the test nodes, is available during training. We propose inductive node2vec (iN2V), which combines a post-hoc procedure to compute embeddings for nodes unseen during training and modifications to the original N2V training procedure to prepare the embeddings for this post-hoc procedure. We conduct experiments on several benchmark datasets and demonstrate that iN2V is an effective approach to bringing transductive embeddings to an inductive setting. Using iN2V embeddings improves node classification by 1 point on average, with up to 6 points of improvement depending on the dataset and the number of unseen nodes. Our iN2V is a plug-in approach to create new or enrich existing embeddings. It can also be combined with other embedding methods, making it a versatile approach for inductive node representation learning. Code to reproduce the results is available at https://github.com/Foisunt/iN2V .
Chinese: 提出的归纳式node2vec(iN2V)方法通过结合改进的训练和后处理程序,将直推式嵌入扩展到归纳式场景,使节点分类提升最高达6个百分点,并可作为现有嵌入技术的通用插件使用。
English: The proposed inductive node2vec (iN2V) method extends transductive embeddings to an inductive setting by combining modified training with a post-hoc procedure, improving node classification by up to 6 points and serving as a versatile plug-in for existing embedding techniques.
Authors:Hyeongwon Jang, Changhun Kim, Eunho Yang
Abstract:
Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics-Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation (CPP)-to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Our code is available at https://github.com/drumpt/TIMING.
中文:近期时间序列可解释人工智能方法常忽略对预测的方向性影响,但传统积分梯度能有效识别关键点;然而现有评估指标存在缺陷,因此我们提出新指标证明其优越性,并引入具有时间感知能力的TIMING方法,在实验中超越现有基准。
English: Recent time series explainable AI methods often miss capturing directional impacts on predictions, but conventional Integrated Gradients (IG) effectively identify critical points; however, their evaluation is flawed, so we propose new metrics showing IG's superiority and introduce TIMING, a temporally-aware IG enhancement that outperforms existing methods.
Authors:Zeming Wei, Yiwen Guo, Yisen Wang
Abstract:
Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.
中文: 本文通过分析跨类别特征提出了对抗训练的新视角,揭示了这类特征在实现鲁棒性中的关键作用,并通过类别特征归因解释了鲁棒过拟合等现象。
English: This paper introduces a novel perspective on adversarial training by analyzing cross-class features, revealing their crucial role in achieving robustness and explaining phenomena like robust overfitting through class-wise feature attribution.
Authors:Kuang He, Wei Tang, Tong Wei, Min-Ling Zhang
Abstract:
Partial label learning (PLL) seeks to train generalizable classifiers from datasets with inexact supervision, a common challenge in real-world applications. Existing studies have developed numerous approaches to progressively refine and recover ground-truth labels by training convolutional neural networks. However, limited attention has been given to foundation models that offer transferrable representations. In this work, we empirically conduct comprehensive evaluations of 11 foundation models across 13 PLL approaches on 8 benchmark datasets under 3 PLL scenarios. We further propose PartialCLIP, an efficient fine-tuning framework for foundation models in PLL. Our findings reveal that current PLL approaches tend to 1) achieve significant performance gains when using foundation models, 2) exhibit remarkably similar performance to each other, 3) maintain stable performance across varying ambiguity levels, while 4) are susceptible to foundation model selection and adaptation strategies. Additionally, we demonstrate the efficacy of text-embedding classifier initialization and effective candidate label filtering using zero-shot CLIP. Our experimental results and analysis underscore the limitations of current PLL approaches and provide valuable insights for developing more generalizable PLL models. The source code can be found at https://github.com/SEU-hk/PartialCLIP.
中文: 本研究评估了基础模型在偏标签学习中的应用,揭示了其显著的性能优势和局限性,并提出了PartialCLIP这一高效微调框架,以提升模型适应性和标签筛选能力。
English: This study evaluates foundation models in partial label learning, revealing their significant performance benefits and limitations, and introduces PartialCLIP, an efficient fine-tuning framework that enhances model adaptability and label filtering.
Authors:Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Abstract:
We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.
中文: ComfyUI-Copilot是一款基于大语言模型的插件,通过智能节点推荐和自动化工作流构建提升ComfyUI的易用性,经实践验证能有效降低新手门槛并提高工作效率。
English: ComfyUI-Copilot is an AI-powered plugin that enhances ComfyUI's usability by providing intelligent recommendations and automated workflow construction, validated through evaluations to lower entry barriers and boost efficiency.
Authors:Fuyi Zhang, Zhu Yu, Chunhao Li, Runmin Zhang, Xiaokai Bai, Zili Zhou, Si-Yuan Cao, Fang Wang, Hui-Liang Shen
Abstract:
Radar has gained much attention in autonomous driving due to its accessibility and robustness. However, its standalone application for depth perception is constrained by issues of sparsity and noise. Radar-camera depth estimation offers a more promising complementary solution. Despite significant progress, current approaches fail to produce satisfactory dense depth maps, due to the unsatisfactory processing of the sparse and noisy radar data. They constrain the regions of interest for radar points in rigid rectangular regions, which may introduce unexpected errors and confusions. To address these issues, we develop a structure-aware strategy for radar depth enhancement, which provides more targeted regions of interest by leveraging the structural priors of RGB images. Furthermore, we design a Multi-Scale Structure Guided Network to enhance radar features and preserve detailed structures, achieving accurate and structure-detailed dense metric depth estimation. Building on these, we propose a structure-aware radar-camera depth estimation framework, named SA-RCD. Extensive experiments demonstrate that our SA-RCD achieves state-of-the-art performance on the nuScenes dataset. Our code will be available at https://github.com/FreyZhangYeh/SA-RCD.
Chinese: 提出的SA-RCD框架通过结构感知策略和多尺度网络改进雷达-相机深度估计,有效处理雷达稀疏性和噪声问题,在nuScenes数据集上取得了最优性能。
English: The proposed SA-RCD framework enhances radar-camera depth estimation by using structure-aware strategies and a multi-scale network to address radar sparsity and noise, achieving state-of-the-art results on the nuScenes dataset.
Authors:Enrique Sanchez, Isma Hadji, Adrian Bulat, Christos Tzelepis, Brais Martinez, Georgios Tzimiropoulos
Abstract:
In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.
中文: 本文提出了一种新颖的图像超分辨率方法,通过结合多尺度处理的分层图像标记化和直接偏好优化来提升变换器性能,使用更小的模型且无需外部数据即实现了最先进的结果。
English: This paper introduces a novel approach to Image Super Resolution by combining a Hierarchical Image Tokenization method for multi-scale processing and Direct Preference Optimization to enhance transformer performance, achieving state-of-the-art results with a smaller model and no external data.
Authors:Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu
Abstract:
Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.
中文摘要:提出的FEAT模型通过引入高效注意力机制和细粒度引导,克服了现有基于Transformer的动态医学视频合成方法的局限性,以更少的参数实现了更优的性能。
English Summary: The proposed FEAT model overcomes limitations in existing Transformer-based methods for dynamic medical video synthesis by introducing efficient attention mechanisms and fine-grained guidance, achieving superior performance with fewer parameters.
Authors:Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Abstract:
With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
中文:现有基准和指标难以在竞赛环境中充分评估大型语言模型的编程能力,因此我们提出了ICPC-Eval这一顶级编程竞赛基准,通过真实题目和新颖的Refine@K指标揭示模型需依赖迭代反馈才能接近人类水平的表现。
English: Existing benchmarks and metrics inadequately assess LLMs' coding skills in competitive settings, prompting the introduction of ICPC-Eval, a challenging benchmark with realistic problems and a novel Refine@K metric that reveals models' reliance on iterative feedback to approach human-level performance.
Authors:Andrew Hamara, Greg Hamerly, Pablo Rivas, Andrew C. Freeman
Abstract:
Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at https://github.com/andrewhamara/SOLIS.
中文摘要:本研究开发了一种基于Transformer的模型,通过监督对比学习将棋盘状态嵌入结构化潜空间,无需深度搜索即可通过轨迹规划实现直觉式走子选择,并达到大师级棋力水平。
English Summary: This research develops a transformer-based model that uses supervised contrastive learning to embed chess positions into a structured latent space, enabling intuitive move selection through trajectory planning without deep search while achieving master-level performance.
Authors:Yu-Feng Chen, Tzuhsuan Huang, Pin-Yen Chiu, Jun-Cheng Chen
Abstract:
Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:https://github.com/aiiu-lab/BackdoorImageEditing
中文: 本文提出了一种针对扩散模型图像编辑的新型后门攻击框架,通过投毒训练数据嵌入隐形触发器,利用不可见水印操控输出,同时保持对干净图像的正常编辑,实验验证了该方法的高攻击成功率。
English: This paper introduces a novel backdoor attack framework for diffusion models in image editing by embedding invisible triggers through poisoned training data, utilizing imperceptible watermarks to manipulate outputs while maintaining normal editing for clean inputs, with experiments confirming high attack success rates.
Authors:Jônata Tyska Carvalho, Stefano Nolfi
Abstract:
We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
中文: 该方法通过迭代学习生成并优化控制策略,使大型语言模型能够操控具身智能体,并在Gymnasium和MuJoCo任务中通过GPT-oss:120b等模型验证了其有效性。
English: This method enables large language models to control embodied agents by generating and refining control strategies through iterative learning, validated on Gymnasium and MuJoCo tasks using models like GPT-oss:120b and Qwen2.5:72b.
Authors:Mario Malizia, Charles Hamesse, Ken Hasselmann, Geert De Cubber, Nikolaos Tsiogkas, Eric Demeester, Rob Haelterman
Abstract:
The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/MineInsight.
中文: 本文介绍了MineInsight,这是首个公开的多传感器数据集,集成了双视角机器人扫描技术,用于多样化和真实的地雷检测算法验证,包含多光谱范围和目标位置信息。
English: This paper introduces MineInsight, the first publicly available multi-sensor dataset integrating dual-view robotic scans for diverse and realistic landmine detection algorithm validation, featuring multiple spectral ranges and target locations.
Authors:Kunshen Zhang
Abstract:
Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.
Chinese: 当前二维推理分割系统依赖显式人工指令或预定义类别,而三维推理分割尚缺乏类似框架,因此本文提出OpenMaskDINO3D大语言模型,通过处理点云数据和文本提示生成精确的三维实例分割掩码,并在ScanNet数据集上验证了其有效性。
English: Current 2D reasoning segmentation systems require explicit human instructions or predefined categories, but a similar framework for 3D reasoning segmentation is lacking, leading to the introduction of OpenMaskDINO3D, an LLM that processes point clouds and text prompts to generate precise 3D instance segmentation masks validated on ScanNet datasets.
Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Abstract:
Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.
中文摘要:RACE框架通过评估推理轨迹的一致性和答案的语义对齐,专门用于检测大型推理模型中的幻觉问题,即使在最终答案正确的情况下也能有效识别逻辑不一致性,优于现有检测方法。
English Summary: The RACE framework is introduced to detect hallucinations in Large Reasoning Models by evaluating reasoning trace consistency and answer alignment, outperforming existing methods in identifying logical inconsistencies even when final answers are correct.
Authors:Svetlana Pavlitska, Jamie Robb, Nikolai Polley, Melih Yazgan, J. Marius Zöllner
Abstract:
Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.
中文摘要:本研究展示了使用印刷补丁对交通信号灯检测系统实施现实对抗攻击的方法,在模拟和真实环境中均成功实现了针对性标签翻转和分类攻击。
English Summary: This study demonstrates realistic adversarial attacks on traffic light detection systems using printed patches, successfully achieving targeted label-flipping and classification attacks in both simulated and real-world environments.
Authors:Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang
Abstract:
Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles: one in natural language and three symbolic variants. We find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis indicates that fine-tuning primarily refines the model's step-by-step generation process, rather than improving its ability to converge on an answer early. Together, our framework and analysis provide a more rigorous lens for evaluating and improving logical reasoning in LLMs. The code is available at https://github.com/YujunZhou/FineLogic.
中文摘要:FineLogic提出了一种细粒度评估框架,超越最终答案准确性来评估大语言模型的逻辑推理能力,研究发现自然语言监督增强泛化能力而符号监督优化推理结构,且微调主要改进逐步生成过程而非早期答案收敛。
English Summary: FineLogic introduces a fine-grained framework to evaluate LLMs' logical reasoning beyond final-answer accuracy, revealing that natural language supervision enhances generalization while symbolic supervision improves structural reasoning, with fine-tuning primarily refining step-by-step generation processes.
Authors:Yuyi Zhang, Yongxin Shi, Peirong Zhang, Yixin Zhao, Zhenhua Yang, Lianwen Jin
Abstract:
Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.
中文摘要:MegaHan97K数据集填补了超大类中文字符识别的关键空白,提供97,455个字符类别并涵盖手写、古籍和合成三个子集,完整支持最新GB18030-2022标准,在揭示新研究挑战的同时为文化遗产保护与数字应用开辟了新前景。
English Summary: The MegaHan97K dataset addresses the critical gap in mega-category Chinese character recognition by providing 97,455 character categories with balanced samples across handwritten, historical, and synthetic subsets, fully supporting the latest GB18030-2022 standard while revealing new research challenges and opportunities.
Authors:Daniel Barath
Abstract:
Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: https://github.com/danini/superansac
中文: SupeRANSAC是一种新颖的统一RANSAC流程,在多种计算机视觉任务中实现了一致的精度,通过将基础矩阵估计的平均AUC值提高6个点,显著超越了现有方法。
English: SupeRANSAC is a novel unified RANSAC pipeline that achieves consistent accuracy across various computer vision tasks, significantly outperforming existing methods by improving fundamental matrix estimation by an average of 6 AUC points.
Authors:Athanasios C. Antoulas, Ion Victor Gosea, Charles Poussot-Vassal, Pierre Vuillemin
Abstract:
In this note, we evaluate the performances, the features and the user-experience of some methods (and their implementations) designed for tensor- (or data-) based multivariate function construction and approximation. To this aim, a collection of multivariate functions extracted from contributive works coming from different communities, is suggested. First, these functions with varying complexity (e.g. number and degree of the variables) and nature (e.g. rational, irrational, differentiable or not, symmetric, etc.) are used to construct tensors, each of different dimension and size on the disk. Second, grounded on this tensor, we inspect performances of each considered method (e.g. the accuracy, the computational time, the parameters tuning impact, etc.). Finally, considering the "best" parameter tuning set, we compare each method using multiple evaluation criteria. The purpose of this note is not to rank the methods but rather to evaluate as fairly as possible the different available strategies, with the idea in mind to guide users to understand the process, the possibilities, the advantages and the limits brought by each tools. The contribution claimed is to suggest a complete benchmark collection of some available tools for tensor approximation by surrogate models (e.g. rational functions, networks, etc.). In addition, as contributors of the multivariate Loewner Framework (mLF) approach (and its side implementation in MDSPACK), attention and details of the latter are more explicitly given, in order to provide readers a digest of this contributive work and some details with simple examples.
本文通过一个多样化的基准函数集评估了多种基于张量的多元函数逼近方法的性能和用户体验,旨在公平比较各工具的优势与局限而非进行排名,并特别介绍了多元Loewner框架方法的细节。
This note evaluates the performance and user experience of various tensor-based multivariate function approximation methods using a diverse benchmark collection to fairly compare their capabilities and limitations without ranking them.
Authors:Yusuke Matsui
Abstract:
Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.
Chinese: LotusFilter是一个后处理模块,通过预计算邻近向量表并贪婪地剔除候选结果中的冗余向量,有效提升近似最近邻搜索结果的多样性,在真实RAG应用中展现出高速处理性能。
English: LotusFilter is a post-processing module that efficiently diversifies approximate nearest neighbor search results by greedily removing redundant vectors using a precomputed cutoff table, operating at high speed in real-world RAG applications.
Authors:Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng
Abstract:
Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.
中文: 语音蕴含超越文本的丰富声学信息,为此我们推出了MMSU基准,旨在全面评估并推动多模态语音大语言模型在包含多种语言特征的语音理解与推理能力上的发展。
English: Speech conveys rich acoustic information beyond text, and the MMSU benchmark is introduced to evaluate and advance multimodal SpeechLLMs' understanding and reasoning in spoken language across diverse linguistic features.
Authors:Jiachen Tang, Zhonghao Wang, Sirui Chen, Sheng Zhou, Jiawei Chen, Jiajun Bu
Abstract:
Graph Transformers (GTs) have recently demonstrated remarkable performance across diverse domains. By leveraging attention mechanisms, GTs are capable of modeling long-range dependencies and complex structural relationships beyond local neighborhoods. However, their applicable scenarios are still underexplored, this highlights the need to identify when and why they excel. Furthermore, unlike GNNs, which predominantly rely on message-passing mechanisms, GTs exhibit a diverse design space in areas such as positional encoding, attention mechanisms, and graph-specific adaptations. Yet, it remains unclear which of these design choices are truly effective and under what conditions. As a result, the community currently lacks a comprehensive benchmark and library to promote a deeper understanding and further development of GTs. To address this gap, this paper introduces OpenGT, a comprehensive benchmark for Graph Transformers. OpenGT enables fair comparisons and multidimensional analysis by establishing standardized experimental settings and incorporating a broad selection of state-of-the-art GNNs and GTs. Our benchmark evaluates GTs from multiple perspectives, encompassing diverse tasks and datasets with varying properties. Through extensive experiments, our benchmark has uncovered several critical insights, including the difficulty of transferring models across task levels, the limitations of local attention, the efficiency trade-offs in several models, the application scenarios of specific positional encodings, and the preprocessing overhead of some positional encodings. We aspire for this work to establish a foundation for future graph transformer research emphasizing fairness, reproducibility, and generalizability. We have developed an easy-to-use library OpenGT for training and evaluating existing GTs. The benchmark code is available at https://github.com/eaglelab-zju/OpenGT.
图变换器在建模复杂依赖关系方面表现出色,但缺乏全面的基准测试,因此推出OpenGT以进行公平比较和多维分析,推动该领域研究发展。
Graph Transformers have shown strong performance in modeling complex dependencies but lack comprehensive benchmarks, prompting the introduction of OpenGT for fair comparisons and multidimensional analysis to advance research.
Authors:Suhan Woo, Seongwon Lee, Jinwoo Jang, Euntai Kim
Abstract:
When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy to enable flexible control over accuracy-efficiency trade-offs and ensure robust matching even between descriptors from different image types. This approach allows HypeVPR to outperform existing methods while significantly accelerating retrieval and reducing database storage requirements. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.
中文: HypeVPR提出了一种双曲空间中的分层嵌入框架,通过利用全景视图固有的层次结构解决透视到等距柱面投影的视觉位置识别难题,实现了高效粗到精搜索和更低存储需求的优越性能。
English: HypeVPR introduces a hierarchical embedding framework in hyperbolic space to address perspective-to-equirectangular visual place recognition challenges by leveraging inherent hierarchical structures in panoramic views, achieving superior performance with efficient coarse-to-fine search and reduced storage.
Authors:Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu
Abstract:
While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.
Chinese: 本研究提出推理激活潜力(RAP)方法,通过筛选关键"认知样本"来提升多模态大语言模型的推理能力,仅用9.3%的训练数据即可实现更优性能,同时降低超过43%的计算成本。
English: This study introduces Reasoning Activation Potential (RAP), a novel data selection method that identifies sparse "cognitive samples" to enhance multi-modal reasoning in MLLMs, achieving superior performance with only 9.3% of training data while cutting computational costs by over 43%.
Authors:Niki Martinel, Rita Pucci
Abstract:
We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of $+0.5$dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than $+1$dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textit{will} be available at https://github.com/iN1k1/.
中文: 该新颖双流架构通过融合物理成像模型与胶囊聚类,实现了无需参数的水下图像增强,在仅需现有方法三分之一计算量的情况下将PSNR提升0.5分贝。
English: This novel dual-stream architecture integrates a physical underwater imaging model with capsule clustering to achieve state-of-the-art enhancement, improving PSNR by +0.5dB with one-third the computational cost of existing methods.
Authors:Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, Zefeng Ying
Abstract:
The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at https://github.com/QiZelu/AIGVEval.
Chinese: 针对AI生成视频的视觉质量问题,本研究提出了一种结合多维度特征和大语言模型的质量评估方法,通过专门设计的提示框架,在NTIRE 2025比赛中荣获第二名。
English: AI-generated videos often suffer from visual defects, so this study introduces a multi-dimensional quality assessment method using large language models and specialized prompts, which achieved second place in the NTIRE 2025 challenge.
Authors:Gio Paik, Geewook Kim, Jinbae Im
Abstract:
This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.
中文: 本文介绍MMRefine基准,用于评估多模态大语言模型在六种场景和错误类型中的检测与修正能力,揭示了当前性能瓶颈和改进方向。
English: This paper presents MMRefine, a benchmark for evaluating multimodal large language models' ability to detect and correct errors across six scenarios and error types, identifying current performance bottlenecks and improvement areas.
Authors:Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh
Abstract:
The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from \emph{low GPU utilization}, \emph{significant latency overhead}, and a fundamental \emph{inability to leverage task locality}, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a \emph{single persistent GPU kernel}. FlashDMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashDMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking \emph{payload efficiency}, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on a single 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashDMoE achieves up to \textbf{9}$\times$ higher GPU utilization, \textbf{6}$\times$ lower latency, \textbf{5.7}$\times$ higher throughput, and \textbf{4}$\times$ better overlap efficiency compared to state-of-the-art baselines, despite using FP32 while baselines use FP16. FlashDMoE demonstrates that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML workloads.
中文:FlashDMoE通过实现完全驻留GPU的算子,将计算和通信融合为单一内核,克服了现有混合专家模型的局限性,在GPU利用率、延迟和吞吐量方面实现了显著提升。
English: FlashDMoE overcomes the limitations of existing Mixture-of-Experts models by implementing a fully GPU-resident operator that fuses computation and communication into a single kernel, achieving significant improvements in GPU utilization, latency, and throughput.
Authors:Juhyun Oh, Eunsu Kim, Alice Oh
Abstract:
Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.
Chinese: Flex-TravelPlanner 是一个评估语言模型动态规划能力的新基准,通过多轮交互和竞争性约束测试,揭示了模型在计划调整和约束优先级处理方面的显著不足。
English: Flex-TravelPlanner is a new benchmark that evaluates language models' dynamic planning abilities through multi-turn scenarios and competing constraints, revealing their limitations in adapting plans and prioritizing constraints effectively.
Authors:Zhuoyun Zhong, Seyedali Golestaneh, Constantinos Chamzas
Abstract:
Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with uncertainty-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods. The source code is available at https://github.com/elpis-lab/ActivePusher.
Chinese: ActivePusher通过结合残差物理建模和基于不确定性的主动学习,提升了非抓取操作的数据效率和规划可靠性,在仿真和真实环境中均优于基线方法。
English: ActivePusher enhances nonprehensile manipulation by integrating residual-physics modeling with uncertainty-driven active learning, improving data efficiency and planning reliability in both simulated and real-world settings.
Authors:Qiming Hu, Linlong Fan, Yiyan Luo, Yuhang Yu, Xiaojie Guo, Qingnan Fan
Abstract:
The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.
中文:提出的TADiSR框架通过文本感知注意力和联合分割解码器,有效恢复了超分辨率图像中的自然细节和文本结构,在真实场景中实现了卓越的性能和泛化能力。
English: The proposed TADiSR framework uses text-aware attention and joint segmentation decoders to effectively restore both natural details and text structures in super-resolved images, achieving superior performance and generalization in real-world scenarios.
Authors:Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna
Abstract:
Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.
中文: STARE基准测试通过多步骤视觉模拟任务评估AI模型的空间推理能力,结果表明人类借助视觉辅助表现优异,而模型在复杂3D和综合拼图任务中表现不佳,仅在简单2D变换上表现良好。
English: The STARE benchmark evaluates AI models on spatial reasoning tasks requiring multi-step visual simulations, revealing that while humans excel with visual aids, models struggle with complex 3D and integrated puzzles despite performing well on simpler 2D transformations.
Authors:Li Liu, Heng Yong
Abstract:
Recently, machine learning methods have gained significant traction in scientific computing, particularly for solving Partial Differential Equations (PDEs). However, methods based on deep neural networks (DNNs) often lack convergence guarantees and computational efficiency compared to traditional numerical schemes. This work introduces DeePoly, a novel framework that transforms the solution paradigm from pure non-convex parameter optimization to a two-stage approach: first employing a DNN to capture complex global features, followed by linear space optimization with combined DNN-extracted features (Spotter) and polynomial basis functions (Sniper). This strategic combination leverages the complementary strengths of both methods -- DNNs excel at approximating complex global features (i.e., high-gradient features) and stabilize the polynomial approximation while polynomial bases provide high-precision local corrections with convergence guarantees. Theoretical analysis and numerical experiments demonstrate that this approach significantly enhances both high-order accuracy and efficiency across diverse problem types while maintaining mesh-free and scheme-free properties. This paper also serves as a theoretical exposition for the open-source project DeePoly.
中文摘要:DeePoly框架通过将深度神经网络用于全局特征提取与多项式基函数进行局部修正相结合,显著提升了求解偏微分方程的精度和效率,同时保持了理论收敛性和无网格特性。
English Summary: The DeePoly framework enhances PDE solving by combining deep neural networks for global feature approximation with polynomial basis functions for precise local corrections, achieving improved accuracy and efficiency while maintaining theoretical guarantees.
Authors:Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev
Abstract:
In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.
中文: 本研究展示了如何利用扩展定律比较模型与数据集,发现MaMMUT在多任务和数据集上比CLIP具有更优的扩展性和样本效率,同时开源了模型与代码以支持复现。
English: This study demonstrates how scaling laws can be used to compare models and datasets, revealing MaMMUT's superior scalability and sample efficiency over CLIP across multiple tasks and datasets, while providing open-source models and code for reproducibility.
Authors:Xun Li, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Khaled B. Letaief
Abstract:
Edge caching is an emerging technology that empowers caching units at edge nodes, allowing users to fetch contents of interest that have been pre-cached at the edge nodes. The key to pre-caching is to maximize the cache hit percentage for cached content without compromising users' privacy. In this letter, we propose a federated learning (FL) assisted edge caching scheme based on lightweight architecture denoising diffusion probabilistic model (LDPM). Our simulation results verify that our proposed scheme achieves a higher cache hit percentage compared to existing FL-based methods and baseline methods.
中文摘要:本文提出了一种基于轻量级去噪扩散概率模型的联邦学习辅助边缘缓存方案,相比现有方法在保护用户隐私的同时实现了更高的缓存命中率。
English Summary: This letter introduces a federated learning-assisted edge caching scheme using a lightweight denoising diffusion probabilistic model, which enhances cache hit rates while preserving user privacy compared to existing methods.
Authors:Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu
Abstract:
Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.
中文: 神经符号方法中,大型语言模型作为逻辑翻译器难以处理词汇多样化问题,导致实际应用不可靠,因此我们提出了SCALe基准和MenTaL方法,以提升模型将多样化表达映射到统一逻辑符号的一致性能力。
English: Neuro-symbolic approaches using LLMs as logic translators struggle with lexical diversification, leading to unreliable performance in real-world scenarios, prompting the creation of the SCALe benchmark and MenTaL method to enhance LLMs' consistency in mapping varied expressions to logical symbols.
Authors:K. O. T. Erziev
Abstract:
We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench
中文摘要:对无法回答的问题进行大语言模型基准测试并非无意义,研究发现现有模型在此类问题上表现远未完善。
English Summary: Benchmarking LLMs on unanswerable questions proves insightful, revealing significant performance gaps despite seeming counterintuitive.
Authors:Nikita Oskolkov, Huzhenyu Zhang, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov
Abstract:
The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target object.This paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of open-vocabulary 3D scene graph. To accelerate and stabilize the training of reinforcement learning-based algorithms, the framework also employs imitation learning and curriculum learning. The first one enables the agent to learn from demonstrations, while the second one structures the training process by gradually increasing task complexity from simple to more advanced scenarios. Numerical experiments conducted in the Isaac Sim environment showed that using a 3D scene graph for reinforcement learning significantly increased the success rate in difficult navigation cases. The code is open-sourced and available at: https://github.com/Xisonik/Aloha\_graph.
中文: 本文提出SGN-CIRL框架,通过三维场景图实现无地图机器人导航,结合模仿学习与课程学习提升训练效果,在复杂环境中显著提高了导航成功率。
English: This paper introduces SGN-CIRL, a framework using 3D scene graphs for mapless robot navigation that enhances training with imitation and curriculum learning, significantly improving success rates in complex environments.
Authors:C. Evans Hedges
Abstract:
We provide evidence that orthogonalizing gradients during training improves model calibration without sacrificing accuracy. On CIFAR-10 with 10\% labeled data, $\perp$Grad matches SGD in accuracy but yields consistently improved calibration metrics such as lower test loss, reduced softmax overconfidence, and higher predictive entropy. These benefits persist under input corruption (CIFAR-10C) and extended training, where $\perp$Grad models degrade more gracefully than SGD-trained counterparts. $\perp$Grad is optimizer-agnostic, incurs minimal overhead, and works well with post-hoc calibration techniques like temperature scaling.
Theoretically, we prove convergence of a simplified version of $\perp$Grad under mild assumptions and characterize its stationary points in positive homogeneous networks: $\perp$Grad converges to solutions where further loss reduction requires confidence scaling rather than decision boundary improvement. Code for this paper can be found at: https://github.com/evanshedges2/orthograd\_improves\_calibration.
中文: 训练过程中正交化梯度可在保持准确性的同时提升模型校准度,具体表现为各项指标优化及在多种条件下的稳健性增强。
English: Orthogonalizing gradients during training enhances model calibration while maintaining accuracy, as demonstrated by improved metrics and robustness under various conditions.
Authors:Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Abstract:
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives.
To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
中文摘要:大型语言模型的水印技术会削弱其真实性、安全性和实用性,但提出的对齐重采样方法能在保持水印可检测性的同时有效恢复这些对齐特性。
English Summary: Watermarking techniques in large language models can degrade their truthfulness, safety, and helpfulness, but the proposed Alignment Resampling method effectively restores these alignment properties while maintaining watermark detectability.
Authors:Hasin Us Sami, Swapneel Sen, Amit K. Roy-Chowdhury, Srikanth V. Krishnamurthy, Basak Guler
Abstract:
Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.
Chinese: 联邦学习中的参数高效微调(PEFT)虽能在保护数据隐私的同时实现协同训练,但本研究发现恶意设计的预训练模型和适配器模块可通过梯度反演攻击,利用共享梯度重构用户的本地数据样本。
English: Federated learning with parameter-efficient fine-tuning (PEFT) enables collaborative model training while preserving data privacy, but this study reveals that maliciously designed pre-trained models and adapters can exploit shared gradients to reconstruct users' local data through gradient inversion attacks.
Authors:Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
Abstract:
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
传统优化器微调大语言模型成本高昂,因此我们提出如JAGUAR SignSGD和JAGUAR Muon等内存高效的零阶方法,在降低资源需求的同时达到一阶方法性能。
Fine-tuning LLMs with traditional optimizers is costly, so we propose memory-efficient zero-order methods like JAGUAR SignSGD and JAGUAR Muon, which match first-order performance while reducing resource demands.
Authors:Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
Abstract:
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
传统优化器微调大语言模型成本高昂,因此我们提出如JAGUAR SignSGD和JAGUAR Muon等内存高效的零阶方法,在降低资源需求的同时达到一阶方法性能。
Fine-tuning LLMs with traditional optimizers is costly, so we propose memory-efficient zero-order methods like JAGUAR SignSGD and JAGUAR Muon, which match first-order performance while reducing resource demands.
Authors:Zihao Dong, Alan Papalia, Leonard Jung, Alenna Spiro, Philip R. Osteen, Christa S. Robison, Michael Everett
Abstract:
A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle's state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91\% success rate crossing a 40m boulder field (compared to 73\% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings. Our code will be available at https://github.com/neu-autonomy/SPARTA.
中文:SPARTA是一种创新方法,通过从点云输出基于傅里叶的平滑风险函数来评估基于接近角的地形可通行性,在模拟中实现了91%的成功率,并展示了实际应用中的泛化能力。
English: SPARTA is a novel method that estimates terrain traversability based on approach angles by outputting a smooth Fourier-based risk function from point clouds, achieving a 91% success rate in simulations and demonstrating real-world generalization.
Authors:Philippe Chlenski, Itsik Pe'er
Abstract:
Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite hyperDT as simple pre- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase. Our implementation is available at https://github.com/pchlenski/hyperdt.
中文: hyperDT算法在Beltrami-Klein双曲模型中被优雅重构,使其能够作为现有欧几里得决策树模型的封装器运行,从而提升效率、灵活性及代码可维护性。
English: The hyperDT algorithm is elegantly reformulated in the Beltrami-Klein hyperbolic model, enabling it to operate as a wrapper around existing Euclidean decision tree models with improved efficiency, flexibility, and code maintainability.
Authors:Xiang Zheng, Xingjun Ma, Wei-Bin Lee, Cong Wang
Abstract:
Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at https://github.com/x-zheng16/RedRFT.git.
中文:RedRFT基准旨在标准化基于强化微调的红队测试方法,通过模块化设计和关键组件消融研究解决了大语言模型安全评估中的实现差异问题。
English: The RedRFT benchmark is introduced to standardize Reinforcement Fine-Tuning (RFT)-based red teaming for Large Language Models, addressing implementation inconsistencies and enabling modular experimentation with comprehensive component analysis.
Authors:Kunal Pai, Parth Shah, Harshil Patel
Abstract:
Rapid Large Language Model (LLM) advancements are fueling autonomous Multi-Agent System (MAS) development. However, current frameworks often lack flexibility, resource awareness, model diversity, and autonomous tool creation. This paper introduces HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization), a novel MAS framework enhancing flexibility, resource efficiency, and adaptability. HASHIRU features a "CEO" agent dynamically managing specialized "employee" agents, instantiated based on task needs and resource constraints (cost, memory). Its hybrid intelligence prioritizes smaller, local LLMs (via Ollama) while flexibly using external APIs and larger models when necessary. An economic model with hiring/firing costs promotes team stability and efficient resource allocation. The system also includes autonomous API tool creation and a memory function. Evaluations on tasks like academic paper review (58% success), safety assessments (100% on a JailbreakBench subset), and complex reasoning (outperforming Gemini 2.0 Flash on GSM8K: 96% vs. 61%; JEEBench: 80% vs. 68.3%; SVAMP: 92% vs. 84%) demonstrate HASHIRU's capabilities. Case studies illustrate its self-improvement via autonomous cost model generation, tool integration, and budget management. HASHIRU offers a promising approach for more robust, efficient, and adaptable MAS through dynamic hierarchical control, resource-aware hybrid intelligence, and autonomous functional extension. Source code and benchmarks are available at https://github.com/HASHIRU-AI/HASHIRU and https://github.com/HASHIRU-AI/HASHIRUBench respectively, and a live demo is available at https://hashiruagentx-hashiruai.hf.space upon request.
中文: 本文提出HASHIRU分层多智能体系统,通过动态管理专业代理、混合智能和自主工具创建来提升灵活性与资源效率,在推理和安全任务中展现出卓越性能。
English: This paper introduces HASHIRU, a hierarchical multi-agent system that enhances flexibility and resource efficiency by dynamically managing specialized agents with hybrid intelligence and autonomous tool creation, demonstrating strong performance across reasoning and safety tasks.
Authors:Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
Abstract:
Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
Chinese: Struct2D框架通过结构化二维输入使大型多模态模型展现出强大的空间推理能力,无需显式三维表征即可在多种3D任务中取得优异表现。
English: The Struct2D framework enables large multimodal models to perform strong spatial reasoning using only structured 2D inputs, achieving competitive results on 3D tasks without requiring explicit 3D representations.
Authors:Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
Abstract:
Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
Chinese: Struct2D框架通过结构化二维输入使大型多模态模型展现出强大的空间推理能力,无需显式三维表征即可在多种3D任务中取得优异表现。
English: The Struct2D framework enables large multimodal models to perform strong spatial reasoning using only structured 2D inputs, achieving competitive results on 3D tasks without requiring explicit 3D representations.
Authors:Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Abstract:
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
中文: 本文提出伪仿真这一自动驾驶汽车评估新范式,通过3D高斯溅射技术融合真实数据与合成观测,无需完整仿真即可评估错误恢复能力,且与闭环基准的相关性(R²=0.8)优于现有方法。
English: This paper introduces pseudo-simulation, a novel evaluation paradigm for autonomous vehicles that combines real-world data efficiency with synthetic observations generated via 3D Gaussian Splatting, enabling error recovery assessment without full simulation while achieving higher correlation (R²=0.8) with closed-loop benchmarks than existing methods.
Authors:Junting Chen, Haotian Liang, Lingxiao Du, Weiyun Wang, Mengkang Hu, Yao Mu, Wenhai Wang, Jifeng Dai, Ping Luo, Wenqi Shao, Lin Shao
Abstract:
The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://github.com/HHYHRHY/OWMM-Agent
Chinese Summary: 本研究提出了一种新型多模态智能体架构和智能数据合成流程,以解决开放世界移动操作任务的挑战,在真实世界实验中实现了最先进的性能和强大的泛化能力。
English Summary: The study introduces a novel multi-modal agent architecture and an agentic data synthesis pipeline to address the challenges of open-world mobile manipulation, achieving state-of-the-art performance and strong generalization in real-world experiments.
Authors:Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Abstract:
Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at https://github.com/heboyong/Diffusion-Domain-Teacher.
中文: 本文提出扩散域教师(DDT)方法,通过冻结权重的扩散模型作为教师模型生成目标域伪标签,在不影响推理速度的情况下显著提升了跨域目标检测性能。
English: This paper introduces Diffusion Domain Teacher (DDT), a method that uses a frozen-weight diffusion model as a teacher to generate pseudo labels for unlabeled target domains, significantly enhancing cross-domain object detection performance without affecting inference speed.
Authors:Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
Abstract:
Long context large language models (LLMs) are deployed in many real-world applications such as RAG, agent, and broad LLM-integrated applications. Given an instruction and a long context (e.g., documents, PDF files, webpages), a long context LLM can generate an output grounded in the provided context, aiming to provide more accurate, up-to-date, and verifiable outputs while reducing hallucinations and unsupported claims. This raises a research question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs) in the context that contribute most to or are responsible for the generated output by an LLM? This process, which we call context traceback, has various real-world applications, such as 1) debugging LLM-based systems, 2) conducting post-attack forensic analysis for attacks (e.g., prompt injection attack, knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources to enhance the trust of users towards outputs generated by LLMs. When applied to context traceback for long context LLMs, existing feature attribution methods such as Shapley have sub-optimal performance and/or incur a large computational cost. In this work, we develop TracLLM, the first generic context traceback framework tailored to long context LLMs. Our framework can improve the effectiveness and efficiency of existing feature attribution methods. To improve the efficiency, we develop an informed search based algorithm in TracLLM. We also develop contribution score ensemble/denoising techniques to improve the accuracy of TracLLM. Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM. Our code and data are at: https://github.com/Wang-Yanting/TracLLM.
中文: 长上下文大语言模型广泛应用于RAG和智能体等场景,能基于文档生成准确输出,但现有方法在追溯输出来源的上下文回溯中效率低下。本文提出TracLLM框架,通过智能搜索算法和贡献值优化技术,显著提升了回溯的准确性和效率,实验验证了其有效性。
English: Long context LLMs are widely used in applications like RAG and agents to generate accurate outputs from extensive documents, but identifying the specific text segments responsible for these outputs—a process called context traceback—remains challenging due to inefficiencies in existing methods. This paper introduces TracLLM, a novel framework that enhances the effectiveness and efficiency of context traceback through informed search algorithms and contribution score techniques, as validated by evaluation results.
Authors:Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu
Abstract:
Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at https://github.com/QingFei1/R-Search.
中文: R-Search是一种强化学习框架,通过多奖励信号优化让大语言模型自主执行多步推理与深度搜索交互,在复杂任务中相比现有方法性能提升最高达32.2%。
English: R-Search is a reinforcement learning framework that enables large language models to autonomously integrate multi-step reasoning with deep search interactions through multi-reward optimization, significantly improving performance on complex tasks by up to 32.2% compared to existing methods.
Authors:Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, Xiaoyu Shen
Abstract:
Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: https://github.com/EIT-NLP/SkipGPT.
Chinese: SkipGPT提出了一种动态层剪枝框架,通过全局令牌感知路由和解耦剪枝策略,在减少超过40%参数的同时保持或超越原始模型的性能表现。
English: SkipGPT introduces a dynamic layer pruning framework that reduces computational costs by over 40% while maintaining or improving performance through token-aware routing and component-specific policies.
Authors:Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine
Abstract:
In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction
本研究发现在强化学习中仅增加数据无法提升离线学习性能,任务周期过长是主因,但通过引入如SHARSA等周期缩减技术可有效突破扩展性瓶颈。
This study finds that scaling data alone fails to improve offline reinforcement learning performance due to long task horizons, but introducing horizon reduction techniques like the proposed SHARSA method effectively unlocks scalability.
Authors:Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang
Abstract:
While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.
中文: 本研究提出IEAP框架,通过将复杂编辑指令分解为原子操作并共享扩散Transformer骨干网络,有效解决了结构不一致的图像编辑难题,在各项评估中显著优于现有方法。
English: This research introduces the Image Editing As Programs (IEAP) framework, which decomposes complex editing instructions into atomic operations using a shared Diffusion Transformer backbone to effectively handle structurally inconsistent image edits and outperform existing methods.
Authors:Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao
Abstract:
The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($Ï$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation
中文摘要:本研究通过比较和因果分析识别捷径神经元,提出了一种捷径神经元修补方法,有效缓解大语言模型评估中的数据污染问题,并与可信基准显示出高度相关性。
English Summary: This study addresses data contamination in large language model evaluations by identifying shortcut neurons through comparative and causal analysis, proposing a shortcut neuron patching method that effectively mitigates contamination and demonstrates strong correlation with trustworthy benchmarks.
Authors:Pei Yang, Hai Ci, Mike Zheng Shou
Abstract:
Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 5%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. macOSWorld is available at https://github.com/showlab/macosworld.
中文摘要:macOSWorld是首个针对macOS图形用户界面代理的综合性基准测试,包含跨30个应用程序的202项多语言交互任务及安全评估,揭示了专有与开源代理间的显著性能差距,并凸显了多语言处理挑战和安全漏洞问题。
English Summary: macOSWorld is the first comprehensive benchmark for evaluating GUI agents on macOS, featuring 202 multilingual interactive tasks across 30 applications with safety benchmarking, revealing significant performance gaps between proprietary and open-source agents and highlighting multilingual challenges and security vulnerabilities.
Authors:Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Abstract:
Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at https://github.com/Disha1001/CLAIM.
中文摘要:本研究提出了LegalCon数据集用于检测法庭对话中的操纵行为,并开发了CLAIM多智能体框架,通过先进自然语言处理技术提升司法过程的公平性与透明度。
English Summary: This research introduces LegalCon, a dataset for detecting manipulation in courtroom conversations, and proposes CLAIM, a multi-agent framework to enhance legal fairness and transparency through advanced NLP analysis.
Authors:Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Abstract:
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $Ï_S(y\mid x)$. We provably approximate the optimal tilted policy $Ï_{β,B}(y\mid x) \propto Ï_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the primary model $Ï_B$. We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of-$n$ with $Ï_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $Ï_B$. The code is available at https://github.com/j-geuter/GSI .
Chinese: 引导推测推理(GSI)是一种新颖算法,通过结合软性最优-n缩放、奖励模型和推测样本,有效提升大型语言模型的奖励引导解码效率,在推理基准测试中比现有方法获得了更高准确率。
English: Guided Speculative Inference (GSI) is a novel algorithm that enhances reward-guided decoding in large language models by combining soft best-of-n scaling with a reward model and speculative samples, achieving higher accuracy on reasoning benchmarks than existing methods.
Authors:Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Abstract:
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate both the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the base model $π_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K), our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$. The code is available at https://github.com/j-geuter/GSI .
Chinese: 引导推测推理(GSI)是一种新颖算法,通过结合软性最优-n缩放、奖励模型和推测样本,有效提升大型语言模型的奖励引导解码效率,在推理基准测试中比现有方法获得了更高准确率。
English: Guided Speculative Inference (GSI) is a novel algorithm that enhances reward-guided decoding in large language models by combining soft best-of-n scaling with a reward model and speculative samples, achieving higher accuracy on reasoning benchmarks than existing methods.
Authors:Robin Bruneau, Baptiste Brument, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis Durou, Lilian Calvet
Abstract:
Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at https://github.com/RobinBruneau/RNb-NeuS2.
中文: 本文提出了一种通用框架,将多视角法线和反射率图融入基于辐射度的表面重建,在复杂可见性条件下实现了捕捉细微结构的最先进性能。
English: This paper presents a versatile framework that integrates multi-view normal and reflectance maps into radiance-based surface reconstruction, achieving state-of-the-art performance in capturing fine details under challenging visibility conditions.
Authors:Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Abstract:
We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning. Our code is available at https://github.com/Lww007/Text-Atari-Agents.
中文: TextAtari将经典Atari游戏转化为文本环境,构建了评估语言智能体在长跨度决策任务中表现的基准,揭示了AI模型与人类在序列推理和战略规划方面存在的显著差距。
English: TextAtari is a benchmark that converts Atari games into text-based environments to evaluate language agents on long-horizon decision-making tasks, revealing significant performance gaps between AI models and humans in sequential reasoning and strategic planning.
Authors:Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov
Abstract:
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
中文摘要:AmbiK数据集被提出作为一个通用基准,旨在解决在具身智能体中比较大型语言模型模糊指令检测方法的难题,该数据集包含1000对经过人工验证的厨房环境模糊与非模糊任务。
English Summary: The AmbiK dataset is introduced as a universal benchmark to address the challenge of comparing ambiguity detection methods for LLMs in embodied agents, featuring 1000 pairs of ambiguous and unambiguous kitchen tasks with human-validated annotations.
Authors:Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.
中文: 本文提出LLMEval-Med基准,基于真实电子病历构建了覆盖五大医疗领域的评估体系,通过自动化专家核对框架对13种大语言模型进行测试,为医疗领域安全应用提供关键洞见。
English: This paper introduces LLMEval-Med, a comprehensive medical benchmark developed from real clinical data to address limitations in existing evaluations by incorporating automated scoring with expert-validated checklists, assessing 13 LLMs across five medical domains for safer deployment.
Authors:Yi Zhao, Siqi Wang, Jing Li
Abstract:
Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study, hence, focuses on producing precise, in-situ, step-by-step navigation instructions that are practically usable by VI users. Concretely, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to generate rewards guiding the Vision-Language Model (VLM) post-training. This enhances instruction usability while reducing costly real-world data needs. To facilitate training and testing, we introduce NIG4VI, a 27k-sample open-sourced benchmark. It provides diverse navigation scenarios with accurate spatial coordinates, supporting detailed, open-ended in-situ instruction generation. Experiments on NIG4VI show the effectiveness of LaF-GRPO by quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU +14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o's 0.323) and yields more intuitive, safer instructions. Code and benchmark are available at \href{https://github.com/YiyiyiZhao/NIG4VI}{https://github.com/YiyiyiZhao/NIG4VI}.
中文: 本研究提出LaF-GRPO方法,通过大语言模型模拟视障用户反馈来优化视觉语言模型,生成精确的实时导航指引,并在NIG4VI基准测试中验证了该方法在实用性和安全性指标上的显著提升。
English: This study introduces LaF-GRPO, a method using LLM-simulated visually impaired user feedback to enhance vision-language models for generating precise, in-situ navigation instructions, validated by the new NIG4VI benchmark showing significant improvements in usability and safety metrics.
Authors:Paul Fuchs, Weilong Chen, Stephan Thaler, Julija Zavadlav
Abstract:
Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs in LAMMPS. chemtrain-deploy supports any JAX-defined semi-local potential, allowing users to exploit the functionality of LAMMPS and perform large-scale MLP-based MD simulations on multiple GPUs. It achieves state-of-the-art efficiency and scales to systems containing millions of atoms. We validate its performance and scalability using graph neural network architectures, including MACE, Allegro, and PaiNN, applied to a variety of systems, such as liquid-vapor interfaces, crystalline materials, and solvated peptides. Our results highlight the practical utility of chemtrain-deploy for real-world, high-performance simulations and provide guidance for MLP architecture selection and future design.
Chinese: chemtrain-deploy框架实现了机器学习势在LAMMPS中的模型无关部署,支持任何JAX定义的半局域势,可在多GPU上实现高效的大规模分子动力学模拟,具备最先进的可扩展性。
English: The chemtrain-deploy framework enables model-agnostic deployment of machine learning potentials in LAMMPS, supporting any JAX-defined semi-local potential for efficient large-scale molecular dynamics simulations across multiple GPUs with state-of-the-art scalability.
Authors:Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper
Abstract:
Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs
中文: 双语视觉语音模型比单语模型表现出更弱的互斥性偏好,部分原因是熟悉数据的视觉嵌入方差减小,导致新概念与熟悉概念之间的混淆增加。
English: Bilingual visually grounded speech models exhibit a weaker mutual exclusivity bias than monolingual ones, partly due to reduced variance in visual embeddings for familiar data, which increases confusion between novel and familiar concepts.
Authors:An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh, Zhuang Li
Abstract:
Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: https://github.com/antangrocket1312/QQSUMM
中文:本文提出了QQSUM这一新任务,通过将多样化的用户意见总结为代表性关键点并量化其普遍性,以改进基于评论的产品问答系统,所提出的QQSUM-RAG模型在文本质量和意见量化准确性方面均优于现有方法。
English: This paper introduces QQSUM, a novel task that enhances Review-based Product Question Answering by summarizing diverse customer opinions into representative Key Points and quantifying their prevalence, with the proposed QQSUM-RAG model outperforming existing methods in both textual quality and quantification accuracy.
Authors:Tiehua Mei, Hengrui Chen, Peng Yu, Jiaqing Liang, Deqing Yang
Abstract:
Although large language models (LLMs) have shown great potential in recommender systems, the prohibitive computational costs for fine-tuning LLMs on entire datasets hinder their successful deployment in real-world scenarios. To develop affordable and effective LLM-based recommender systems, we focus on the task of coreset selection which identifies a small subset of fine-tuning data to optimize the test loss, thereby facilitating efficient LLMs' fine-tuning. Although there exist some intuitive solutions of subset selection, including distribution-based and importance-based approaches, they often lead to suboptimal performance due to the misalignment with downstream fine-tuning objectives or weak generalization ability caused by individual-level sample selection. To overcome these challenges, we propose GORACS, which is a novel Group-level Optimal tRAnsport-guided Coreset Selection framework for LLM-based recommender systems. GORACS is designed based on two key principles for coreset selection: 1) selecting the subsets that minimize the test loss to align with fine-tuning objectives, and 2) enhancing model generalization through group-level data selection. Corresponding to these two principles, GORACS has two key components: 1) a Proxy Optimization Objective (POO) leveraging optimal transport and gradient information to bound the intractable test loss, thus reducing computational costs by avoiding repeated LLM retraining, and 2) a two-stage Initialization-Then-Refinement Algorithm (ITRA) for efficient group-level selection. Our extensive experiments across diverse recommendation datasets and tasks validate that GORACS significantly reduces fine-tuning costs of LLMs while achieving superior performance over the state-of-the-art baselines and full data training. The source code of GORACS are available at https://github.com/Mithas-114/GORACS.
中文: 为解决大型语言模型在推荐系统中微调成本高昂的问题,本研究提出了GORACS框架,通过基于最优传输的群组级核心集选择来优化测试损失并增强泛化能力,从而以更低成本实现卓越性能。
English: To address the high computational costs of fine-tuning large language models (LLMs) for recommender systems, the study introduces GORACS, a group-level optimal transport-based coreset selection framework that efficiently minimizes test loss and enhances generalization, achieving superior performance with reduced resources.
Authors:Maxime Zanella, Clément Fuchs, Ismail Ben Ayed, Christophe De Vleeschouwer
Abstract:
Recent advances in few-shot adaptation for Vision-Language Models (VLMs) have greatly expanded their ability to generalize across tasks using only a few labeled examples. However, existing approaches primarily build upon the strong zero-shot priors of these models by leveraging carefully designed, task-specific prompts. This dependence on predefined class names can restrict their applicability, especially in scenarios where exact class names are unavailable or difficult to specify. To address this limitation, we introduce vocabulary-free few-shot learning for VLMs, a setting where target class instances - that is, images - are available but their corresponding names are not. We propose Similarity Mapping (SiM), a simple yet effective baseline that classifies target instances solely based on similarity scores with a set of generic prompts (textual or visual), eliminating the need for carefully handcrafted prompts. Although conceptually straightforward, SiM demonstrates strong performance, operates with high computational efficiency (learning the mapping typically takes less than one second), and provides interpretability by linking target classes to generic prompts. We believe that our approach could serve as an important baseline for future research in vocabulary-free few-shot learning. Code is available at https://github.com/MaxZanella/vocabulary-free-FSL.
中文: 本文提出针对视觉语言模型的词汇表无关小样本学习方法,通过相似性映射(SiM)这一高效基线方案,仅使用通用提示对图像进行分类而无需预定义类别名称,在保持优异性能的同时具备高计算效率。
English: This paper introduces vocabulary-free few-shot learning for Vision-Language Models, proposing Similarity Mapping (SiM) as an efficient baseline that classifies images using generic prompts without requiring predefined class names, achieving strong performance and high computational efficiency.
Authors:Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
Abstract:
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.
中文:尽管长上下文语言模型兴起,但简单的DOS RAG方法在问答任务中始终与复杂多阶段流程持平或更优,建议将其作为未来检索增强生成评估的强基准。
English: Despite the emergence of long-context language models, the simple DOS RAG method consistently matches or surpasses complex multi-stage pipelines in QA tasks, establishing it as a strong baseline for future evaluations.
Authors:Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio
Abstract:
AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustness. Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at https://huggingface.co/datasets/aimagelab/RAID and evaluation code at https://github.com/pralab/RAID.
中文摘要:当前最先进的人工智能生成图像检测器极易受到对抗性样本的攻击,RAID数据集揭示了现有检测方法的严重脆弱性,凸显了开发更强健检测技术的迫切需求。
English Summary: AI-generated image detectors are highly vulnerable to adversarial attacks, as demonstrated by the RAID dataset, which exposes significant weaknesses in current state-of-the-art detection methods and underscores the urgent need for more robust solutions.
Authors:Takeshi Saga, Catherine Pelachaud
Abstract:
Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at https://github.com/sagatake/VAPwithAudioFaceEncoders.
Chinese: 本文提出了一种结合预训练音频和面部编码器的多模态模型,通过捕捉细微表情来提升对话轮次预测性能,在关键指标上表现优异,甚至超越了现有最优模型。
English: This paper introduces a multimodal model enhanced with pre-trained audio and face encoders that captures subtle expressions to improve turn-taking prediction, performing competitively or even surpassing state-of-the-art models on key metrics.
Authors:Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Abstract:
The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.
中文: 本文提出SynthQuestions数据集,通过结合自上而下的用户归因和自下而上的网络文档合成的新颖归因基础框架,生成百万条指令以改进大语言模型对齐,在多个基准测试中取得了领先性能。
English: This paper introduces SynthQuestions, a dataset of one million instructions generated through a novel attributed grounding framework that combines top-down user-based attribution with bottom-up synthesis from web documents to enhance large language model alignment, achieving state-of-the-art performance on benchmarks.
Authors:HyunGi Kim, Jisoo Mok, Dongjun Lee, Jaihyun Lew, Sungjae Kim, Sungroh Yoon
Abstract:
Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available at https://github.com/kimanki/CAROTS.
中文摘要:本文提出CAROTS方法,通过引入因果关系的对比学习框架,利用保持和干扰因果关系的样本增强技术,有效提升了多元时间序列异常检测的性能。
English Summary: This paper introduces CAROTS, a novel multivariate time-series anomaly detection method that integrates causality into contrastive learning through causality-preserving and -disturbing data augmentations to better distinguish normal and abnormal patterns.
Authors:Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun
Abstract:
Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model's plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.
中文: 提出的ACL框架在持续学习任务前引入即插即用的适应阶段,通过将嵌入向量与原始类别原型对齐来优化预训练模型,有效平衡稳定性与可塑性以提升性能。
English: The proposed ACL framework introduces a plug-and-play adaptation phase before continual learning tasks to refine pre-trained models by aligning embeddings with original class prototypes, effectively balancing stability and plasticity to enhance performance.
Authors:Jianqing Zhang, Xinghao Wu, Yanbing Zhou, Xiaoting Sun, Qiqi Cai, Yang Liu, Yang Hua, Zhenzhe Zheng, Jian Cao, Qiang Yang
Abstract:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.
Chinese: 为解决传统联邦学习无法支持异构模型及缺乏标准化基准的问题,异构联邦学习库(HtFLlib)被提出作为一个可扩展框架,整合了多样化数据集、模型架构和方法,以推动该领域的研究与应用。
English: To address the limitations of traditional Federated Learning in supporting heterogeneous models and the lack of a standardized benchmark, the Heterogeneous Federated Learning Library (HtFLlib) is introduced as an extensible framework integrating diverse datasets, model architectures, and methods to facilitate research and applications in this field.
Authors:Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun
Abstract:
The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters. Code: https://github.com/byyx666/Dual-Arch.
Chinese: 本文提出Dual-Arch框架,通过分别针对可塑性和稳定性设计的两个独立网络,在架构层面解决持续学习中的稳定性-可塑性平衡问题,在提升性能的同时将参数量减少高达87%。
English: The paper introduces Dual-Arch, a plug-in framework for continual learning that utilizes two specialized networks—one for plasticity and one for stability—to address the stability-plasticity dilemma at the architectural level, improving performance while reducing parameters by up to 87%.
Authors:Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu
Abstract:
LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.
Chinese: 作者提出了TableEval基准,通过整合多样化表格结构、多语言数据和真实领域内容来弥补现有TableQA系统的不足,并开发了与人类判断高度一致的SEAT评估框架。
English: The authors introduce TableEval, a comprehensive benchmark addressing limitations in existing TableQA systems by incorporating diverse table structures, multilingual data, and real-world domains, along with a new evaluation framework SEAT that better aligns with human judgment.
Authors:Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay, Ben Glocker, Tom Vercauteren
Abstract:
Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses
Chinese: 本研究提出了一种可微分的边际L1平均校准误差(mL1-ACE)损失函数,通过硬分箱和软分箱方法有效提升医学图像分割网络的校准性能,在保持分割精度的同时显著降低了校准误差,增强了深度学习模型在临床应用中的可靠性。
English: This study introduces a differentiable marginal L1 Average Calibration Error (mL1-ACE) loss function to improve the calibration of medical image segmentation networks, significantly reducing calibration errors while largely preserving segmentation accuracy across multiple datasets.
Authors:Junqi Gao, Xiang Zou, YIng Ai, Dong Li, Yichen Niu, Biqing Qi, Jianxing Liu
Abstract:
Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi-level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi-agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi-level dependency modeling and adaptive reasoning depth. Additionally, the Self-Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self-reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability. Our code is available at https://github.com/gjq100/Graph-Counselor.git.
中文摘要:Graph Counselor通过多智能体协作和自反思机制,克服了现有GraphRAG方法在信息聚合和推理机制上的局限性,实现了自适应信息提取和动态推理深度调整,在图推理任务中表现出更优性能。
English Summary: Graph Counselor overcomes the limitations of existing GraphRAG methods by employing multi-agent collaboration and self-reflection mechanisms to achieve adaptive information extraction and dynamic reasoning depth adjustment, demonstrating superior performance in graph reasoning tasks.
Authors:Cédric Léonard, Dirk Stober, Martin Schulz
Abstract:
New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 66 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at https://github.com/CedricLeon/Survey_RS-ML-FPGA.
中文: 新型无人机和太空技术通过部署在FPGA上的机器学习模型实现了地球观测的实时数据处理,本文系统分析了66个相关实验并公开所有数据以确保透明度。
English: New UAV and NewSpace technologies are revolutionizing Earth Observation by enabling real-time data processing through machine learning models deployed on FPGAs, as systematically reviewed in 66 experiments with shared data for transparency.
Authors:Marcin Kowalczyk, Kamil Jeziorek, Tomasz Kryjak
Abstract:
Event-based sensors offer significant advantages over traditional frame-based cameras, especially in scenarios involving rapid motion or challenging lighting conditions. However, event data frequently suffers from considerable noise, negatively impacting the performance and robustness of deep learning models. Traditionally, this problem has been addressed by applying filtering algorithms to the event stream, but this may also remove some of relevant data. In this paper, we propose a novel noise-injection training methodology designed to enhance the neural networks robustness against varying levels of event noise. Our approach introduces controlled noise directly into the training data, enabling models to learn noise-resilient representations. We have conducted extensive evaluations of the proposed method using multiple benchmark datasets (N-Caltech101, N-Cars, and Mini N-ImageNet) and various network architectures, including Convolutional Neural Networks, Vision Transformers, Spiking Neural Networks, and Graph Convolutional Networks. Experimental results show that our noise-injection training strategy achieves stable performance over a range of noise intensities, consistently outperforms event-filtering techniques, and achieves the highest average classification accuracy, making it a viable alternative to traditional event-data filtering methods in an object classification system. Code: https://github.com/vision-agh/DVS_Filtering
中文摘要:本文提出一种噪声注入训练方法,通过在训练过程中注入受控噪声来增强神经网络对事件数据噪声的鲁棒性,在多个数据集和网络架构上的实验表明,该方法比传统滤波技术获得了更高的分类准确率。
English Summary: This paper introduces a noise-injection training method that enhances neural network robustness against event data noise by injecting controlled noise during training, achieving superior classification accuracy across multiple datasets and architectures compared to traditional filtering techniques.
Authors:Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen
Abstract:
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
中文: 提出的Pre$^3$方法通过将LR(1)文法转换为具有预计算前缀条件边的确定性下推自动机,优化了大型语言模型的解码效率,实现并行转换并降低运行时开销,使令牌生成速度提升最高达40%,吞吐量提高最高达36%。
English: The proposed Pre$^3$ method optimizes LLM decoding efficiency by transforming LR(1) grammars into deterministic pushdown automata with precomputed prefix-conditioned edges, enabling parallel transitions and reducing runtime overhead to achieve up to 40% faster token generation and 36% higher throughput.
Authors:Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen
Abstract:
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
中文: 提出的Pre$^3$方法通过将LR(1)文法转换为具有预计算前缀条件边的确定性下推自动机,优化了大型语言模型的解码效率,实现并行转换并降低运行时开销,使令牌生成速度提升最高达40%,吞吐量提高最高达36%。
English: The proposed Pre$^3$ method optimizes LLM decoding efficiency by transforming LR(1) grammars into deterministic pushdown automata with precomputed prefix-conditioned edges, enabling parallel transitions and reducing runtime overhead to achieve up to 40% faster token generation and 36% higher throughput.
Authors:Sam Pollard, Michael Wray
Abstract:
Video transformer models require huge amounts of compute resources due to the spatio-temporal scaling of the input. Tackling this, recent methods have proposed to drop or merge tokens for image models, whether randomly or via learned methods. Merging tokens has many benefits: it can be plugged into any vision transformer, does not require model re-training, and it propagates information that would otherwise be dropped through the model. Before now, video token merging has not been evaluated on temporally complex datasets for video understanding. In this work, we explore training-free token merging for video to provide comprehensive experiments and find best practices across four video transformers on three datasets that exhibit coarse and fine-grained action recognition. Our results showcase the benefits of video token merging with a speedup of around $2.5$X while maintaining accuracy (avg. $-0.55\%$ for ViViT). Code available at https://github.com/sjpollard/video-how-do-your-tokens-merge.
Chinese: 本研究探索了视频变换器的免训练令牌合并方法,在多个模型和数据集上实现了约2.5倍加速,同时保持了接近原始的准确率。
English: This study explores training-free token merging for video transformers, achieving a 2.5X speedup while maintaining near-original accuracy across multiple models and datasets.
Authors:Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu
Abstract:
Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.
中文摘要:本文提出一种候选标注新范式,通过让大语言模型对不确定样本输出所有可能的标签,并设计名为CanDist的师生框架,利用小语言模型蒸馏这些候选标注,从而为下游任务提供更高质量的数据保障。
English Summary: This paper introduces a candidate annotation paradigm that leverages LLMs to generate multiple possible labels for uncertain samples, coupled with a teacher-student framework called CanDist that distills these annotations using a smaller language model to ensure data quality for downstream tasks.
Authors:Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu
Abstract:
The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.
中文: HTSC-2025基准数据集通过汇编2023至2025年基于BCS理论预测的超导材料,解决了该领域缺乏标准化数据的问题,为AI算法提供公平比较基准,将加速超导材料的发现进程。
English: The HTSC-2025 benchmark dataset addresses the lack of standardized data in AI-driven high-temperature superconductor research by compiling theoretical predictions from 2023-2025, enabling fair algorithm comparisons and accelerating material discovery.
Authors:Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Roman Vaculin, Natalia Martinez, Fearghal O'donncha, Jayant Kalagnanam
Abstract:
AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows -- such as condition monitoring, maintenance planning, and intervention scheduling -- to reduce human workload and minimize system downtime. Traditional AI/ML approaches have primarily tackled these problems in isolation, solving narrow tasks within the broader operational pipeline. In contrast, the emergence of AI agents and large language models (LLMs) introduces a next-generation opportunity: enabling end-to-end automation across the entire asset lifecycle. This paper envisions a future where AI agents autonomously manage tasks that previously required distinct expertise and manual coordination. To this end, we introduce AssetOpsBench -- a unified framework and environment designed to guide the development, orchestration, and evaluation of domain-specific agents tailored for Industry 4.0 applications. We outline the key requirements for such holistic systems and provide actionable insights into building agents that integrate perception, reasoning, and control for real-world industrial operations. The software is available at https://github.com/IBM/AssetOpsBench.
中文: 本文提出AssetOpsBench统一框架,利用AI智能体与大语言模型实现工业资产全生命周期端到端自动化,旨在以集成化系统替代传统孤立解决方案,降低停机时间与人力成本。
English: This paper introduces AssetOpsBench, a unified framework leveraging AI agents and LLMs to enable end-to-end automation of industrial asset lifecycle management, aiming to replace isolated traditional approaches with integrated systems that reduce downtime and human effort.
Authors:Fabian Karl, Ansgar Scherp
Abstract:
Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.
中文: CRAWLDoc是一种新颖的方法,通过出版物URL对链接网页文档进行上下文排序,实现了跨不同格式和出版商的、独立于网页布局的稳健元数据提取。
English: CRAWLDoc is a novel method that contextually ranks linked web documents from publication URLs, enabling robust and layout-independent metadata extraction across diverse formats and publishers.
Authors:Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang
Abstract:
This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.
中文: 本研究首次将视觉上下文学习应用于光学字符识别,通过任务链组合器和上下文感知聚合增强推理能力,提出的ConText模型在多项基准测试中达到了最先进的性能。
English: This study introduces ConText, the first model to adapt visual in-context learning for optical character recognition by employing a task-chaining compositor and context-aware aggregation to enhance reasoning, achieving state-of-the-art results across benchmarks.
Authors:Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, Huaisong Zhang, Chun Yuan
Abstract:
The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at https://github.com/Longin-Yu/ComRoPE
中文: 本文提出的ComRoPE通过可训练的交换角度矩阵改进了旋转位置编码,解决了原有方法的局限性,在ImageNet-1K数据集上取得了超越现有最佳方法1.6%-2.9%的性能提升。
English: This paper introduces ComRoPE, a trainable extension of Rotary Positional Encoding that uses commuting angle matrices to overcome RoPE's limitations, achieving state-of-the-art performance improvements on ImageNet-1K.
Authors:Shuai Liu, Mingyue Cui, Boyang Li, Quanmin Liang, Tinghe Hong, Kai Huang, Yunxiao Shan, Kai Huang
Abstract:
Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code is available at https://github.com/Say2L/FSHNet.
中文: FSHNet通过引入SlotFormer模块和动态稀疏标签分配策略,增强了全稀疏3D检测器的长距离特征提取能力和网络优化,在多个主流基准测试中表现出色。
English: FSHNet introduces a SlotFormer block and dynamic sparse label assignment to enhance long-range feature extraction and network optimization in fully sparse 3D detectors, achieving superior performance on major benchmarks.
Authors:Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Abstract:
In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
中文摘要:我们基于早期融合的视频定位模型在Ego4D情景记忆挑战赛的三个赛道中均获冠军,通过精确识别未剪辑自中心视频中的时间区间,证明了该方法的卓越性能。
English Summary: Our early fusion-based video localization model won first place across all three tracks of the Ego4D Episodic Memory Challenge by precisely localizing intervals in untrimmed egocentric videos.
Authors:Aditya Gandhamal, Aniruddh Sikdar, Suresh Sundaram
Abstract:
Open-vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out-of-domain generalization, we propose Cost Aggregation with Optimal Transport (OV-COAST) for open-vocabulary semantic segmentation. To align visual-language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two-stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT-Seg model. We evaluate state-of-the-art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost-aggregation model CAT-Seg with ViT-B backbone, achieving superior results, surpassing CAT-Seg by 1.72 % and SAN-B by 4.9 % mIoU. The code is available at https://github.com/adityagandhamal/OV-COAST/}{https://github.com/adityagandhamal/OV-COAST/ .
Chinese: 本文提出OV-COAST方法,通过最优传输的成本聚合来对齐视觉-语言特征,从而提升开放词汇语义分割的性能,在MESS基准测试中取得了领先结果。
English: This paper introduces OV-COAST, a method that enhances open-vocabulary semantic segmentation by applying cost aggregation with optimal transport to align visual-language features, achieving superior performance on the MESS benchmark.
Authors:Pei-Yun Lin, Yen-lung Tsai
Abstract:
This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: https://github.com/peiyun2260/ScoreRAG.
Chinese: ScoreRAG通过结合检索增强生成、相关性评分和结构化摘要的多阶段框架,旨在减少自动新闻生成中的幻觉问题,提升准确性、连贯性和专业性。
English: ScoreRAG is a multi-stage framework that enhances automated news generation by integrating retrieval-augmented generation, relevance scoring, and structured summarization to reduce hallucinations and improve accuracy, coherence, and professionalism.
Authors:Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng
Abstract:
Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token's computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.
中文摘要:AdaDecode通过在高置信度时自适应地在中间层生成令牌,实现并行计算,在保证与标准自回归解码输出一致性的同时,将解码吞吐量最高提升1.73倍。
English Summary: AdaDecode accelerates LLM decoding by adaptively generating tokens at intermediate layers when confidence is high, enabling parallel computation and achieving up to 1.73x speedup while maintaining output consistency with standard autoregressive decoding.
Authors:Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, Xiaoyi Zeng
Abstract:
Discriminative recommendation tasks, such as CTR (click-through rate) and CVR (conversion rate) prediction, play critical roles in the ranking stage of large-scale industrial recommender systems. However, training a discriminative model encounters a significant overfitting issue induced by data sparsity. Moreover, this overfitting issue worsens with larger models, causing them to underperform smaller ones. To address the overfitting issue and enhance model scalability, we propose a framework named GPSD (\textbf{G}enerative \textbf{P}retraining for \textbf{S}calable \textbf{D}iscriminative Recommendation), drawing inspiration from generative training, which exhibits no evident signs of overfitting. GPSD leverages the parameters learned from a pretrained generative model to initialize a discriminative model, and subsequently applies a sparse parameter freezing strategy. Extensive experiments conducted on both industrial-scale and publicly available datasets demonstrate the superior performance of GPSD. Moreover, it delivers remarkable improvements in online A/B tests. GPSD offers two primary advantages: 1) it substantially narrows the generalization gap in model training, resulting in better test performance; and 2) it leverages the scalability of Transformers, delivering consistent performance gains as models are scaled up. Specifically, we observe consistent performance improvements as the model dense parameters scale from 13K to 0.3B, closely adhering to power laws. These findings pave the way for unifying the architectures of recommendation models and language models, enabling the direct application of techniques well-established in large language models to recommendation models. The code is available at https://github.com/chqiwang/gpsd-rec.
中文: GPSD框架通过生成式预训练和稀疏参数冻结策略解决判别式推荐模型中的过拟合问题,实现了模型规模扩展时的持续性能提升。
English: The GPSD framework addresses overfitting in discriminative recommendation models by leveraging generative pretraining and sparse parameter freezing, enabling scalable performance improvements across model sizes.
Authors:Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang
Abstract:
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $γ$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $γ$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $γ$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $γ$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $γ$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.
中文: 本文提出$γ$-PO算法,通过动态调整奖励边界来优先处理高置信度数据对并抑制噪声,在保持训练效率的同时将模型对齐性能平均提升4.4%,为LLM对齐提供了即插即用的强化方案。
English: This paper introduces $γ$-PO, a dynamic target margin preference optimization algorithm that enhances LLM alignment by calibrating reward margins to prioritize high-confidence pairs and suppress noise, achieving a 4.4% average improvement across benchmarks with minimal efficiency impact.
Authors:Zunhui Xia, Hongxing Li, Libin Lan
Abstract:
In the childbirth process, traditional methods involve invasive vaginal examinations, but research has shown that these methods are both subjective and inaccurate. Ultrasound-assisted diagnosis offers an objective yet effective way to assess fetal head position via two key parameters: Angle of Progression (AoP) and Head-Symphysis Distance (HSD), calculated by segmenting the fetal head (FH) and pubic symphysis (PS), which aids clinicians in ensuring a smooth delivery process. Therefore, accurate segmentation of FH and PS is crucial. In this work, we propose a sparse self-attention network architecture with good performance and high computational efficiency, named DSSAU-Net, for the segmentation of FH and PS. Specifically, we stack varying numbers of Dual Sparse Selection Attention (DSSA) blocks at each stage to form a symmetric U-shaped encoder-decoder network architecture. For a given query, DSSA is designed to explicitly perform one sparse token selection at both the region and pixel levels, respectively, which is beneficial for further reducing computational complexity while extracting the most relevant features. To compensate for the information loss during the upsampling process, skip connections with convolutions are designed. Additionally, multiscale feature fusion is employed to enrich the model's global and local information. The performance of DSSAU-Net has been validated using the Intrapartum Ultrasound Grand Challenge (IUGC) 2024 \textit{test set} provided by the organizer in the MICCAI IUGC 2024 competition\footnote{\href{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}}, where we win the fourth place on the tasks of classification and segmentation, demonstrating its effectiveness. The codes will be available at https://github.com/XiaZunhui/DSSAU-Net.
中文摘要:传统分娩检查中的侵入性阴道指检主观性强且不准确,因此本研究提出DSSAU-Net稀疏自注意力网络,通过高效分割胎儿头部与耻骨联合实现超声辅助的客观分娩评估,并在国际竞赛中验证了其有效性。
English Summary: Traditional invasive vaginal exams for childbirth are subjective and inaccurate, so this study introduces DSSAU-Net, a computationally efficient sparse self-attention network that accurately segments fetal head and pubic symphysis in ultrasound images to objectively assess delivery progress.
Authors:Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang
Abstract:
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.
中文: EgoLoc方法提出了一种零样本时序交互定位技术,通过自适应视觉提示和闭环反馈,结合二维与三维观测数据,精准识别第一人称视频中的抓取动作时刻,在准确性和效率上均优于现有基准方法。
English: The proposed EgoLoc method introduces a zero-shot temporal interaction localization approach for egocentric videos, utilizing adaptive visual prompts and closed-loop feedback to accurately identify grasp action timings by integrating 2D and 3D observations, outperforming existing methods in efficiency and precision.
Authors:Zeyu Gao, Junlin Zhou, Bolun Zhang, Yi He, Chao Zhang, Yuxin Cui, Hao Wang
Abstract:
The quantity and quality of vulnerability datasets are essential for developing deep learning solutions to vulnerability-related tasks. Due to the limited availability of vulnerabilities, a common approach to building such datasets is analyzing security patches in source code. However, existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches that fail to clearly represent the root causes of vulnerabilities or their fixes. These issues introduce noise into the dataset, which can mislead detection models and undermine their effectiveness. To address these issues, we present mono, a novel LLM-powered framework that simulates human experts' reasoning process to construct reliable vulnerability datasets. mono introduces three key components to improve security patch datasets: (i) semantic-aware patch classification for precise vulnerability labeling, (ii) iterative contextual analysis for comprehensive code understanding, and (iii) systematic root cause analysis to identify and filter undecidable patches. Our comprehensive evaluation on the MegaVul benchmark demonstrates that mono can correct 31.0% of labeling errors, recover 89% of inter-procedural vulnerabilities, and reveals that 16.7% of CVEs contain undecidable patches. Furthermore, mono's enriched context representation improves existing models' vulnerability detection accuracy by 15%. We open source the framework mono and the dataset MonoLens in https://github.com/vul337/mono.
中文:mono框架利用LLM驱动的推理,通过改进补丁分类、上下文分析和过滤不可判定补丁来增强漏洞数据集,使检测模型的准确率提高了15%。
English: The mono framework leverages LLM-powered reasoning to enhance vulnerability datasets by improving patch classification, contextual analysis, and filtering undecidable patches, resulting in a 15% accuracy boost for detection models.
Authors:Yinfan Wang, Jie Gui, Baosheng Yu, Qi Li, Zhenan Sun, Juho Kannala, Guoying Zhao
Abstract:
A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M -- the largest available finger vein dataset -- containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91\% performance gain across multiple benchmarks. The dataset is publicly available at: https://github.com/EvanWang98/FingerVeinSyn-5M.
中文: 本研究推出了FVeinSyn合成生成器,创建了包含500万标注样本的最大指静脉数据集FingerVeinSyn-5M,通过少量真实数据微调后,使识别模型性能平均提升53.91%。
English: The study introduces FVeinSyn, a synthetic generator that created FingerVeinSyn-5M, the largest finger vein dataset with 5 million annotated samples, enabling a 53.91% performance gain in recognition models after fine-tuning with minimal real data.
Authors:Ayuto Tsutsumi, Yuu Jinnai
Abstract:
Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at https://github.com/CyberAgentA ILab/YokaiEval.
中文: 大语言模型常局限于英语文化知识,因此本研究推出YokaiEval基准,通过809道日本妖怪民俗问题测试发现,使用日语资源训练的模型表现优于以英语为中心的模型。
English: Large Language Models often lack cultural knowledge beyond English-speaking communities, so this study introduces YokaiEval, a benchmark of 809 questions on Japanese Yokai folklore, revealing that models trained with Japanese resources outperform English-centric ones.
Authors:Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu
Abstract:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.
中文总结:视觉语言模型通过视觉拼接能力,可从分散的良性图像碎片中重构有害内容,从而规避训练数据审核并在推理时生成危险回应。
English Summary: Vision-language models can reconstruct harmful content from seemingly safe image patches during training, exploiting their visual stitching ability to bypass data moderation and generate dangerous responses.
Authors:Shaoshan Liu, Fan Wang, Hongjun Zhou, Yuanfeng Wang
Abstract:
While theory and practice are often seen as separate domains, this article shows that theoretical insight is essential for overcoming real-world engineering barriers. We begin with a practical challenge: training a cross-morphology embodied AI policy that generalizes across diverse robot morphologies. We formalize this as the Heterogeneous Embodied Agent Training (HEAT) problem and prove it reduces to a structured Partially Observable Markov Decision Process (POMDP) that is PSPACE-complete. This result explains why current reinforcement learning pipelines break down under morphological diversity, due to sequential training constraints, memory-policy coupling, and data incompatibility. We further explore Collective Adaptation, a distributed learning alternative inspired by biological systems. Though NEXP-complete in theory, it offers meaningful scalability and deployment benefits in practice. This work illustrates how computational theory can illuminate system design trade-offs and guide the development of more robust, scalable embodied AI. For practitioners and researchers to explore this problem, the implementation code of this work has been made publicly available at https://github.com/airs-admin/HEAT
中文: 本文通过将异构具身智能体训练问题形式化并探索分布式学习方案,论证了理论洞察对于解决现实工程挑战的重要性,尤其体现在训练跨形态具身人工智能策略方面。
English: This article demonstrates that theoretical insights are essential for overcoming real-world engineering challenges, particularly in training cross-morphology embodied AI policies, by formalizing the Heterogeneous Embodied Agent Training (HEAT) problem and exploring distributed learning alternatives like Collective Adaptation.
Authors:Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho
Abstract:
Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present \textbf{\benchname{}}, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at https://github.com/krafton-ai/Orak.
大型语言模型智能体正通过更智能的角色重塑游戏产业,然而现有基准在评估多样化能力、智能模块及微调数据方面存在不足,为此推出Orak基准——涵盖12款游戏、即插即用接口和微调数据集,以全面评估并推动游戏智能体发展。
Large Language Model agents are revolutionizing the game industry by enabling smarter characters, yet current benchmarks fail to assess diverse capabilities, agentic modules, and lack fine-tuning data, prompting the introduction of Orak—a comprehensive benchmark with 12 games, a plug-and-play interface, and a fine-tuning dataset to evaluate and advance gaming agents.
Authors:Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho
Abstract:
Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at https://github.com/krafton-ai/Orak.
大型语言模型智能体正通过更智能的角色重塑游戏产业,然而现有基准在评估多样化能力、智能模块及微调数据方面存在不足,为此推出Orak基准——涵盖12款游戏、即插即用接口和微调数据集,以全面评估并推动游戏智能体发展。
Large Language Model agents are revolutionizing the game industry by enabling smarter characters, yet current benchmarks fail to assess diverse capabilities, agentic modules, and lack fine-tuning data, prompting the introduction of Orak—a comprehensive benchmark with 12 games, a plug-and-play interface, and a fine-tuning dataset to evaluate and advance gaming agents.
Authors:Hiroki Shiraishi, Yohei Hayamizu, Tomonori Hashiyama, Keiki Takadama, Hisao Ishibuchi, Masaya Nakata
Abstract:
Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at https://github.com/YNU-NakataLab/Beta4-UCS. An extended abstract related to this work is available at https://doi.org/10.36227/techrxiv.174900805.59801248/v1.
中文: 本研究在分类器学习系统中引入了一种基于四参数贝塔分布的灵活规则表示方法,能够自动适应不同子空间的规则表示需求,从而以更紧凑、可解释的规则集实现更高的分类准确率。
English: This study introduces a flexible rule representation using a four-parameter beta distribution in Learning Classifier Systems, enabling automatic adaptation of rule representations for different subspaces and achieving higher accuracy with more compact, interpretable rule sets.
Authors:Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, Yu-Gang Jiang
Abstract:
The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a "comprehend-then-generate" paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader exploration of reasoning trajectories and select the optimal one using a metric-based output reward model (ORM). Extensive experimental results demonstrate that ControlThinker effectively mitigates the semantic gap between raw text prompts and target images, resulting in improved visual quality and semantic consistency across a wide range of benchmarks. The code and models are available at https://github.com/Maplebb/ControlThinker.
中文摘要:ControlThinker是一种创新框架,通过利用视觉推理MLLM从控制图像中挖掘潜在语义来丰富文本提示,有效缩小语义差距,从而提升图像生成的视觉质量和语义一致性。
English Summary: ControlThinker is a novel framework that enhances controllable image generation by using a visual reasoning MLLM to enrich text prompts with latent semantics from control images, thereby reducing the semantic gap and improving visual quality and consistency.
Authors:Shengjie Lin, Jiading Fang, Muhammad Zubair Irshad, Vitor Campagnolo Guizilini, Rares Andrei Ambrus, Greg Shakhnarovich, Matthew R. Walter
Abstract:
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
Chinese: SplArt是一种自监督框架,通过位姿RGB图像重建铰接物体并推断运动学,无需3D监督即可实现实时逼真渲染。
English: SplArt is a self-supervised framework that reconstructs articulated objects and infers kinematics from posed RGB images, enabling real-time photorealistic rendering without 3D supervision.
Authors:Viktor Hangya, Fabian Küch, Darina Gold
Abstract:
Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut
Chinese: 本研究通过将生成式任务转化为成本更低的选择题形式,显著降低了大型语言模型评估的计算负担,在四项关键能力上实现了强性能相关性,并使评估速度平均提升超过35倍。
English: This study introduces a method to reduce the computational cost of evaluating large language models by converting generative tasks into cheaper multiple-choice formats, achieving strong performance correlation and over 35x faster evaluation across four key capabilities.
Authors:Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Abstract:
The decision-making process significantly influences the predictions of machine learning models. This is especially important in rule-based systems such as Learning Fuzzy-Classifier Systems (LFCSs) where the selection and application of rules directly determine prediction accuracy and reliability. LFCSs combine evolutionary algorithms with supervised learning to optimize fuzzy classification rules, offering enhanced interpretability and robustness. Despite these advantages, research on improving decision-making mechanisms (i.e., class inference schemes) in LFCSs remains limited. Most LFCSs use voting-based or single-winner-based inference schemes. These schemes rely on classification performance on training data and may not perform well on unseen data, risking overfitting. To address these limitations, this article introduces a novel class inference scheme for LFCSs based on the Dempster-Shafer Theory of Evidence (DS theory). The proposed scheme handles uncertainty well. By using the DS theory, the scheme calculates belief masses (i.e., measures of belief) for each specific class and the ``I don't know'' state from each fuzzy rule and infers a class from these belief masses. Unlike the conventional schemes, the proposed scheme also considers the ``I don't know'' state that reflects uncertainty, thereby improving the transparency and reliability of LFCSs. Applied to a variant of LFCS (i.e., Fuzzy-UCS), the proposed scheme demonstrates statistically significant improvements in terms of test macro F1 scores across 30 real-world datasets compared to conventional voting-based and single-winner-based fuzzy inference schemes. It forms smoother decision boundaries, provides reliable confidence measures, and enhances the robustness and generalizability of LFCSs in real-world applications. Our implementation is available at https://github.com/YNU-NakataLab/jUCS.
中文: 本文基于Dempster-Shafer理论提出了一种新的学习模糊分类系统类别推理方案,通过引入"不确定"状态改进了对不确定性的处理,提高了系统的透明度和可靠性,并在多个数据集上验证了其优越性能。
English: This article introduces a novel class inference scheme for Learning Fuzzy-Classifier Systems based on Dempster-Shafer Theory, which improves uncertainty handling, transparency, and reliability by incorporating an "I don't know" state and demonstrates enhanced performance across multiple datasets.
Authors:Zhigang Yang, Huiguang Yao, Linmao Tian, Xuezhi Zhao, Qiang Li, Qi Wang
Abstract:
Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real-world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets supporting single-object, multi-object, and non-object segmentation scenarios. Additionally, we propose the Multi-scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra-scale Feature Interaction Module (IFIM) that captures fine-grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross-scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU-Refer dataset demonstrate that MRSNet achieves state-of-the-art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at https://github.com/CVer-Yang/NWPU-Refer.
中文摘要:本文提出了目前最大最全面的遥感指代分割数据集NWPU-Refer,并设计了具有跨尺度特征交互能力的MRSNet网络,在该数据集上取得了最优性能。
English Summary: The NWPU-Refer dataset is introduced as the largest and most diverse referring remote sensing image segmentation dataset to date, while the proposed MRSNet framework with innovative feature interaction modules achieves state-of-the-art performance on this benchmark.
Authors:Rui Yann, Tianshuo Zhang, Xianglei Xing
Abstract:
We present SemiOccam, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, requiring hundreds of GPU hours for training, while their generalization ability with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on three commonly used datasets, with accuracy exceeding 95% on two of them using only 4 labeled samples per class, and its simple architecture keeps training time at the minute level. Notably, this paper reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning and removes duplicates to ensure reliable experimental results. We release the deduplicated CleanSTL-10 dataset to facilitate fair and reproducible research. Code available at https://github.com/Shu1L0n9/SemiOccam.
中文: SemiOccam提出了一种高效的半监督图像识别网络,仅用每类四个标注样本就在两个数据集上实现超过95%的准确率,同时揭露并修正了STL-10数据集的数据泄露问题以确保结果可靠性。
English: SemiOccam introduces an efficient semi-supervised image recognition network that achieves over 95% accuracy on two datasets with just four labeled samples per class, while exposing and rectifying data leakage in STL-10 to ensure reliable results.
Authors:Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia
Abstract:
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
中文:MiMo-VL-7B模型在视觉语言任务中实现顶尖性能,通过四阶段预训练和混合强化学习方法在多项基准测试中超越竞争对手,并公开提供了完整模型和评估套件。
English: The MiMo-VL-7B models achieve state-of-the-art performance in vision-language tasks, outperforming competitors across multiple benchmarks through a four-stage pre-training and mixed reinforcement learning approach, with full model and evaluation suite publicly released.
Authors:Li Zeqiao, Wang Yijing, Wang Haoyu, Li Zheng, Li Peng, Zuo zhiqiang, Hu Chuan
Abstract:
Autonomous driving promises significant advancements in mobility, road safety and traffic efficiency, yet reinforcement learning and imitation learning face safe-exploration and distribution-shift challenges. Although human-AI collaboration alleviates these issues, it often relies heavily on extensive human intervention, which increases costs and reduces efficiency. This paper develops a confidence-guided human-AI collaboration (C-HAC) strategy to overcome these limitations. First, C-HAC employs a distributional proxy value propagation method within the distributional soft actor-critic (DSAC) framework. By leveraging return distributions to represent human intentions C-HAC achieves rapid and stable learning of human-guided policies with minimal human interaction. Subsequently, a shared control mechanism is activated to integrate the learned human-guided policy with a self-learning policy that maximizes cumulative rewards. This enables the agent to explore independently and continuously enhance its performance beyond human guidance. Finally, a policy confidence evaluation algorithm capitalizes on DSAC's return distribution networks to facilitate dynamic switching between human-guided and self-learning policies via a confidence-based intervention function. This ensures the agent can pursue optimal policies while maintaining safety and performance guarantees. Extensive experiments across diverse driving scenarios reveal that C-HAC significantly outperforms conventional methods in terms of safety, efficiency, and overall performance, achieving state-of-the-art results. The effectiveness of the proposed method is further validated through real-world road tests in complex traffic conditions. The videos and code are available at: https://github.com/lzqw/C-HAC.
Chinese Summary: 本文提出了一种置信度引导的人机协作策略(C-HAC),通过动态整合人类指导与自主学习来提升自动驾驶的安全性和性能,在多种场景下均取得了最优效果。
English Summary: This paper introduces a confidence-guided human-AI collaboration (C-HAC) strategy that dynamically integrates human guidance with autonomous learning to enhance safety and performance in autonomous driving, achieving state-of-the-art results across various scenarios.
Authors:Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Abstract:
Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
中文摘要:提出的位置专家(PosS)方法通过为不同位置分配专门的草稿层,有效减少错误累积,显著提高了大语言模型推理中的令牌接受率和加速比。
English Summary: The proposed Position Specialists (PosS) method enhances speculative decoding by using specialized draft layers for different token positions, effectively reducing error accumulation and improving acceptance rates and inference speed in large language models.
Authors:Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui
Abstract:
Multimodal image fusion effectively aggregates information from diverse modalities, with fused images playing a crucial role in vision systems. However, existing methods often neglect frequency-domain feature exploration and interactive relationships. In this paper, we propose wavelet-aware Intra-inter Frequency Enhancement Fusion (WIFE-Fusion), a multimodal image fusion framework based on frequency-domain components interactions. Its core innovations include: Intra-Frequency Self-Attention (IFSA) that leverages inherent cross-modal correlations and complementarity through interactive self-attention mechanisms to extract enriched frequency-domain features, and Inter-Frequency Interaction (IFI) that enhances enriched features and filters latent features via combinatorial interactions between heterogeneous frequency-domain components across modalities. These processes achieve precise source feature extraction and unified modeling of feature extraction-aggregation. Extensive experiments on five datasets across three multimodal fusion tasks demonstrate WIFE-Fusion's superiority over current specialized and unified fusion methods. Our code is available at https://github.com/Lmmh058/WIFE-Fusion.
中文: 本文提出WIFE-Fusion多模态图像融合框架,通过频率内自注意力和频率间交互机制增强频域特征,在多个数据集和融合任务中展现出优越性能。
English: This paper introduces WIFE-Fusion, a multimodal image fusion framework that enhances frequency-domain features through Intra-Frequency Self-Attention and Inter-Frequency Interaction, demonstrating superior performance across multiple datasets and fusion tasks.
Authors:Chenglong Ye, Gang Xiong, Junyou Shang, Xingyuan Dai, Xiaoyan Gong, Yisheng Lv
Abstract:
Traffic simulation tools, such as SUMO, are essential for urban mobility research. However, such tools remain challenging for users due to complex manual workflows involving network download, demand generation, simulation setup, and result analysis. In this paper, we introduce SUMO-MCP, a novel platform that not only wraps SUMO' s core utilities into a unified tool suite but also provides additional auxiliary utilities for common preprocessing and postprocessing tasks. Using SUMO-MCP, users can issue simple natural-language prompts to generate traffic scenarios from OpenStreetMap data, create demand from origin-destination matrices or random patterns, run batch simulations with multiple signal-control strategies, perform comparative analyses with automated reporting, and detect congestion for signal-timing optimization. Furthermore, the platform allows flexible custom workflows by dynamically combining exposed SUMO tools without additional coding. Experiments demonstrate that SUMO-MCP significantly makes traffic simulation more accessible and reliable for researchers. We will release code for SUMO-MCP at https://github.com/ycycycl/SUMO-MCP in the future.
中文: 本文介绍了SUMO-MCP平台,它通过集成SUMO核心工具并支持自然语言处理来自动生成和分析交通场景,显著提高了交通仿真的易用性和可靠性。
English: The paper introduces SUMO-MCP, a platform that simplifies traffic simulation by integrating SUMO's utilities with natural-language processing for automated scenario generation and analysis, making it more accessible and reliable for researchers.
Authors:Yongxiang Tang, Yanhua Cheng, Xiaocheng Liu, Chenchen Jiao, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang
Abstract:
In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at https://github.com/tyxaaron/GCM.
中文摘要:本文通过将严格单调概率问题重构为可观测收益变量与潜在成本变量间的偏序关系,提出了生成式成本模型(GCM与IGCM),在保持严格单调和隐式单调关系方面显著优于现有方法。
English Summary: This paper reframes strict monotonic probability as a partial order between observable revenue and latent cost variables, introducing Generative Cost Models (GCM and IGCM) that outperform existing methods in maintaining both strict and implicit monotonic relationships.
Authors:Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang
Abstract:
Social network simulation is developed to provide a comprehensive understanding of social networks in the real world, which can be leveraged for a wide range of applications such as group behavior emergence, policy optimization, and business strategy development. However, billions of individuals and their evolving interactions involved in social networks pose challenges in accurately reflecting real-world complexities. In this study, we propose a comprehensive Social Network Simulation System (GA-S3) that leverages newly designed Group Agents to make intelligent decisions regarding various online events. Unlike other intelligent agents that represent an individual entity, our group agents model a collection of individuals exhibiting similar behaviors, facilitating the simulation of large-scale network phenomena with complex interactions at a manageable computational cost. Additionally, we have constructed a social network benchmark from 2024 popular online events that contains fine-grained information on Internet traffic variations. The experiment demonstrates that our approach is capable of achieving accurate and highly realistic prediction results. Code is open at https://github.com/AI4SS/GAS-3.
中文: 本研究提出的GA-S3系统采用创新的群体智能体,能高效模拟大规模社交网络,在控制计算成本的同时实现了高度逼真的预测效果。
English: This study introduces the GA-S3 system, which uses innovative Group Agents to simulate large-scale social networks efficiently, achieving highly realistic predictions while managing computational costs.
Authors:Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He
Abstract:
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
中文: 本研究提出了一种智能代理框架,通过动态挖掘丰富语义特征并采用自我优化机制,显著提升了缺失模态重建的精度和适应性,优于现有基础模型。
English: This study introduces an agentic framework that enhances missing modality reconstruction by dynamically mining rich semantic features and employing self-refinement, significantly improving accuracy and adaptability over existing foundation models.
Authors:Chong Li, Jiajun Zhang, Chengqing Zong
Abstract:
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4$\text{e}^2$ of strong baseline methods to 1.2$\text{e}^2$ after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.
中文摘要:TokAlign是一种通过词汇对齐和参数重组来优化大型语言模型词汇表的高效方法,显著提升了文本压缩率并促进了模型间的知识迁移。
English Summary: TokAlign is an efficient method that aligns vocabularies between Large Language Models by learning token mappings and rearranging parameters, significantly improving text compression and enabling effective token-level knowledge transfer.
Authors:Daniel Campa, Mehdi Saeedi, Ian Colbert, Srinjoy Das
Abstract:
Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at https://github.com/daniel-campa/mf-copula.
中文: 作者提出了一种基于非参数统计和Copula模型的新型路径生成与评估方法,能够为视频游戏生成拟人化导航路径,在解决深度学习模型局限性的同时提供了可解释性及对路径真实感的精确控制。
English: The authors introduce a novel path generation and evaluation method using nonparametric statistics and copula models to produce human-like navigation paths for video games, offering interpretability and control over path realism while addressing the limitations of deep learning approaches.
Authors:Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen
Abstract:
Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at https://github.com/gyc-nii/CAS-CS-and-dual-head-detector
中文: 随着ChatGPT等大型语言模型的快速发展,AI在学术写作中的应用日益普遍,但现有的二元检测方法无法有效评估不同程度的人机协作,因此我们提出了一种基于BERTScore的新方法,能够准确测量人类参与度并取得显著成效。
English: The rapid advancement of large language models like ChatGPT has led to widespread use of AI in academic writing, but current binary detection methods fail to account for varying levels of human-machine collaboration, prompting the development of a novel BERTScore-based approach that successfully measures human involvement with high accuracy.
Authors:Yuxuan Han, Junfeng Lyu, Kuan Sheng, Minghao Que, Qixuan Zhang, Lan Xu, Feng Xu
Abstract:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.
中文摘要:本文提出一种新方法,通过利用基于工作室扫描数据训练的扩散先验和分块采样技术,显著提升了智能手机视频的面部外观捕捉质量,大幅缩小了低成本设备与专业工作室录制之间的质量差距。
English Summary: This paper introduces a novel method that significantly improves facial appearance capture quality from smartphone videos by leveraging a diffusion prior trained on studio-scanned data and a patch-level sampling technique, bridging the gap between low-cost and professional studio recordings.
Authors:Xinru Ying, Jiaqi Mo, Jingyang Lin, Canghong Jin, Fangfang Wang, Lina Wei
Abstract:
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
Chinese: MamFusion框架通过多Mamba模块与时间融合,有效建模长视频内容与文本-视频相关性,提升了部分相关视频检索的性能,在大规模数据集上实现了最优效果。
English: The MamFusion framework leverages multi-Mamba modules with temporal fusion to enhance partially relevant video retrieval by modeling long-term video content and text-video relevance, achieving state-of-the-art performance on large-scale datasets.
Authors:Mahesh Godavarti
Abstract:
We present an empirical validation of the directional non-commutative monoidal embedding framework recently introduced in prior work~\cite{Godavarti2025monoidal}. This framework defines learnable compositional embeddings using distinct non-commutative operators per dimension (axis) that satisfy an interchange law, generalizing classical one-dimensional transforms. Our primary goal is to verify that this framework can effectively model real data by applying it to a controlled, well-understood task: image classification on the MNIST dataset~\cite{lecun1998gradient}. A central hypothesis for why the proposed monoidal embedding works well is that it generalizes the Discrete Fourier Transform (DFT)~\cite{oppenheim1999discrete} by learning task-specific frequency components instead of using fixed basis frequencies. We test this hypothesis by comparing learned monoidal embeddings against fixed DFT-based embeddings on MNIST. The results show that as the embedding dimensionality decreases (e.g., from 32 to 8 to 2), the performance gap between the learned monoidal embeddings and fixed DFT-based embeddings on MNIST grows increasingly large. This comparison is used as an analytic tool to explain why the framework performs well: the learnable embeddings can capture the most discriminative spectral components for the task. Overall, our experiments confirm that directional non-commutative monoidal embeddings are highly effective for representing image data, offering a compact learned representation that retains high task performance. The code used in this work is available at https://github.com/mahesh-godavarti/directional_composition_mnist.
中文: 本研究验证了方向性非交换幺半群嵌入框架,通过实验证明其能学习任务特定的频谱成分,在MNIST分类任务中(尤其在低维情况下)显著优于固定的基于离散傅里叶变换的嵌入方法。
English: This study validates a directional non-commutative monoidal embedding framework, demonstrating its superiority over fixed DFT-based embeddings in MNIST classification, especially in low dimensions, by learning task-specific spectral components.
Authors:Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
中文: 本综述探讨了将大型语言模型融入轨迹预测的研究进展,将近期工作归纳为五个方向,分析其方法、核心设计及挑战,旨在连接自然语言处理与轨迹预测领域,提供统一视角。
English: This survey explores the integration of large language models into trajectory prediction, categorizing recent advances into five key directions and analyzing their methods, design choices, and challenges to bridge natural language processing with trajectory prediction.
Authors:Yi Xu, Ruining Yang, Yitian Zhang, Jianglin Lu, Mingyuan Zhang, Yizhou Wang, Lili Su, Yun Fu
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
中文: 本综述探讨了将大型语言模型融入轨迹预测的研究进展,将近期工作归纳为五个方向,分析其方法、核心设计及挑战,旨在连接自然语言处理与轨迹预测领域,提供统一视角。
English: This survey explores the integration of large language models into trajectory prediction, categorizing recent advances into five key directions and analyzing their methods, design choices, and challenges to bridge natural language processing with trajectory prediction.
Authors:Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T. C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Schürch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood
Abstract:
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
中文:KRONOS是专为空间蛋白质组学设计的基础模型,通过从多样化成像数据中学习多尺度生物表征,在细胞表型分析和患者分层等任务中实现了顶尖性能,同时支持高效的图像块级分析。
English: KRONOS is a foundation model for spatial proteomics that learns multi-scale biological representations from diverse imaging data, achieving state-of-the-art performance in tasks like cell phenotyping and patient stratification while enabling efficient patch-level analysis.
Authors:Zihui Ma, Lingyao Li, Juan Li, Wenyue Hua, Jingxiao Liu, Qingyuan Feng, Yuki Miura
Abstract:
Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: https://github.com/missa7481/EMNLP25_earthquake
中文: 本研究提出结构化3M流程,利用多模态大语言模型整合图文信号评估灾害影响,结果显示其与地震数据高度相关,但性能受语言、震中距离和输入模态影响。
English: This study introduces a structured 3M pipeline using multimodal large language models to effectively assess disaster impacts by integrating image-text signals, showing strong correlation with seismic data while noting performance variations based on language, distance, and modality.
Authors:Aldan Creo, Héctor Cerezo-Costas, Pedro Alonso-Doval, Maximiliano Hormazábal-Lagos
Abstract:
Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI.
We introduce "Ask a Local", a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains.
Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection.
中文: “Ask a Local”方法通过比较专业模型的困惑度分布来检测大语言模型中的幻觉,提供了一种无需调整或外部数据即可扩展的多语言解决方案。
English: The "Ask a Local" method detects hallucinations in LLMs by comparing perplexity distributions of specialized models, offering a scalable, multilingual solution without requiring adaptation or external data.
Authors:Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
Abstract:
The emergence of vision-language-action models (VLAs) for end-to-end control is reshaping the field of robotics by enabling the fusion of multimodal sensory inputs at the billion-parameter scale. The capabilities of VLAs stem primarily from their architectures, which are often based on frontier large language models (LLMs). However, LLMs are known to be susceptible to adversarial misuse, and given the significant physical risks inherent to robotics, questions remain regarding the extent to which VLAs inherit these vulnerabilities. Motivated by these concerns, in this work we initiate the study of adversarial attacks on VLA-controlled robots. Our main algorithmic contribution is the adaptation and application of LLM jailbreaking attacks to obtain complete control authority over VLAs. We find that textual attacks, which are applied once at the beginning of a rollout, facilitate full reachability of the action space of commonly used VLAs and often persist over longer horizons. This differs significantly from LLM jailbreaking literature, as attacks in the real world do not have to be semantically linked to notions of harm. We make all code available at https://github.com/eliotjones1/robogcg .
中文: 视觉语言动作模型在机器人应用中易受基于大语言模型越狱攻击的改编影响,此类攻击无需语义危害即可实现对其动作的完全控制,且具有持续性。
English: Vision-language-action models (VLAs) in robotics are vulnerable to adapted LLM jailbreaking attacks that enable full control over their actions, differing from traditional attacks by not requiring harmful semantics and persisting over time.
Authors:Guillermo Marco, Julio Gonzalo, VÃctor Fresno
Abstract:
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared "preference space". Reader vectors cluster into two profiles: 'surface-focused readers' (mainly non-experts), who prioritize readability and textual richness; and 'holistic readers' (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.
中文摘要:最新研究表明,关于AI与人类文学质量评价的矛盾源于读者偏好差异:表层导向型读者注重可读性与文本丰富性,而整体导向型读者更看重主题发展及情感动态。
English Summary: Recent research reveals that conflicting assessments of AI versus human literary quality stem from distinct reader preferences, with surface-focused readers valuing readability and textual richness, while holistic readers prioritize thematic depth and sentiment dynamics.
Authors:Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam
Abstract:
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
中文摘要:FailureSensorIQ是一个专为评估大语言模型在工业4.0领域复杂推理能力而设计的创新多选问答基准系统,通过多维度分析揭示了现有模型在扰动响应和专业知识方面存在的显著不足。
English Summary: FailureSensorIQ is a specialized MCQA benchmark that evaluates LLMs' reasoning capabilities in Industry 4.0 scenarios, revealing performance gaps despite some models nearing expert-level accuracy.
Authors:Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang
Abstract:
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
中文: 本研究提出了用于配体结合位点检测的端到端框架UniSite、UniSite-DS数据集及基于交并比的评估指标,实验证明其性能优于现有方法。
English: This study introduces UniSite, an end-to-end framework for ligand binding site detection, along with the UniSite-DS dataset and an IoU-based evaluation metric, demonstrating superior performance over existing methods.
Authors:Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu
Abstract:
Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.
中文: NetPress是一个自动化框架,通过动态生成可扩展基准并集成网络模拟器反馈,全面评估LLM代理在网络应用中的正确性、安全性和延迟,弥补静态评估的不足。
English: NetPress is an automated framework that dynamically generates scalable benchmarks for evaluating LLM agents in network applications, integrating emulator feedback to assess correctness, safety, and latency beyond static datasets.
Authors:Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Abstract:
Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.
中文: DiaBlo是一种参数高效微调方法,仅更新权重矩阵的对角块,无需低秩近似或特殊初始化即可实现稳定收敛,并保持与LoRA相当的训练效率。
English: DiaBlo is a parameter-efficient fine-tuning method that updates only diagonal blocks of weight matrices, achieving stable convergence and comparable efficiency to LoRA without requiring low-rank approximations or special initialization.
Authors:Jinwei Zeng, Yu Liu, Guozhen Zhang, Jingtao Ding, Yuming Lin, Jian Yuan, Yong Li
Abstract:
Accurately estimating high-resolution carbon emissions is crucial for effective emission governance and mitigation planning. While conventional methods for precise carbon accounting are hindered by substantial data collection efforts, the rise of open data and advanced learning techniques offers a promising solution. Once an open data-based prediction model is developed and trained, it can easily infer emissions for new areas based on available open data. To address this, we incorporate two modalities of open data, satellite images and point-of-interest (POI) data, to predict high-resolution urban carbon emissions, with satellite images providing macroscopic and static and POI data offering fine-grained and relatively dynamic functionality information. However, estimating high-resolution carbon emissions presents two significant challenges: the intertwined and implicit effects of various functionalities on carbon emissions, and the complex spatial contiguity correlations that give rise to the agglomeration effect. Our model, OpenCarbon, features two major designs that target the challenges: a cross-modality information extraction and fusion module to extract complementary functionality information from two modules and model their interactions, and a neighborhood-informed aggregation module to capture the spatial contiguity correlations. Extensive experiments demonstrate our model's superiority, with a significant performance gain of 26.6\% on R2. Further generalizability tests and case studies also show OpenCarbon's capacity to capture the intrinsic relation between urban functionalities and carbon emissions, validating its potential to empower efficient carbon governance and targeted carbon mitigation planning. Codes and data are available: https://github.com/JinweiZzz/OpenCarbon.
中文: OpenCarbon模型创新性地结合卫星影像和兴趣点数据,通过跨模态信息融合和邻域聚合机制解决功能交互与空间连续性问题,在实现碳排放高精度估算方面性能提升26.6%,为碳治理和减排规划提供了有效工具。
English: OpenCarbon is a novel model that utilizes satellite imagery and POI data to accurately estimate high-resolution urban carbon emissions by addressing functionality interactions and spatial correlations, achieving a 26.6% performance gain and demonstrating strong potential for carbon governance and mitigation planning.
Authors:Dania Herzalla, Willian T. Lunardi, Martin Andreoni
Abstract:
Graph-based learning provides a powerful framework for modeling complex relational structures; however, its application within the domain of wireless security remains significantly underexplored. In this work, we introduce the first application of graph-based learning for jamming source localization, addressing the imminent threat of jamming attacks in wireless networks. Unlike geometric optimization techniques that struggle under environmental uncertainties and dense interference, we reformulate the localization as an inductive graph regression task. Our approach integrates structured node representations that encode local and global signal aggregation, ensuring spatial coherence and adaptive signal fusion. To enhance robustness, we incorporate an attention-based \ac{GNN} that adaptively refines neighborhood influence and introduces a confidence-guided estimation mechanism that dynamically balances learned predictions with domain-informed priors. We evaluate our approach under complex \ac{RF} environments with various sampling densities, network topologies, jammer characteristics, and signal propagation conditions, conducting comprehensive ablation studies on graph construction, feature selection, and pooling strategies. Results demonstrate that our novel graph-based learning framework significantly outperforms established localization baselines, particularly in challenging scenarios with sparse and obfuscated signal information. Our code is available at https://github.com/tiiuae/gnn-jamming-source-localization.
中文: 本研究首次提出基于图学习的干扰源定位框架,通过自适应信号聚合和置信度引导机制,在复杂无线环境中显著优于传统定位方法。
English: This study introduces the first graph-based learning framework for jamming source localization, which outperforms traditional methods by integrating adaptive signal aggregation and confidence-guided mechanisms to handle complex wireless environments.
Authors:Yunqi Hong, Sohyun An, Andrew Bai, Neil Y. C. Lin, Cho-Jui Hsieh
Abstract:
Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP
中文: AutoSEP是一种自监督提示学习框架,通过从未标注数据中迭代学习区分性提示来增强多模态大语言模型的细粒度图像分类能力,无需模型训练即可显著提升分类准确率。
English: AutoSEP is a self-supervised prompt learning framework that enhances MLLMs' fine-grained image classification by iteratively learning discriminative prompts from unlabeled data, achieving significant accuracy improvements without model training.
Authors:Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo
Abstract:
Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: https://github.com/ekramalam/HFD_3DCNN.
中文: 本文提出一种基于视觉的老年人跌倒检测系统,利用预训练的3D卷积神经网络提取时空特征,仅需训练SVM分类器即可有效区分跌倒与日常活动。
English: This paper presents a vision-based fall detection system for seniors using a pre-trained 3D CNN to extract spatiotemporal features, with only the SVM classifier trained on datasets to efficiently distinguish falls from daily activities.
Authors:Jiaming Yi, Ruirui Pan, Jishen Yang, Xiulong Yang
Abstract:
Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model's internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at https://github.com/Jamieyi2004/MINT.
中文摘要:MINT框架通过记忆提示库根据测试图像的层次化视觉特征动态生成关联提示,无需源数据或重新训练即可实现视觉语言模型在测试时的精准自适应。
English Summary: The MINT framework enhances Vision-Language Models' test-time adaptation by using a Memory Prompt Bank to dynamically generate associative prompts from hierarchical visual features, enabling precise adaptation without source data or retraining.
Authors:Liangrui Pan, Xingchen Li, Zhongyi Chen, Ling Chu, Shaoliang Peng
Abstract:
Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.
中文: DLiPath基于组织病理学图像数据集首次建立了供肝全面评估的基准,实验证明多种多示例学习模型在关键病理特征评估中表现优异,为自动化供肝评估研究指明了方向。
English: DLiPath introduces the first benchmark for comprehensive donor liver assessment using a histopathology image dataset, demonstrating that multiple-instance learning models achieve high accuracy in evaluating key pathological features to advance automated liver graft evaluation.
Authors:Bin Wang, Yongqi Han, Minbo Ma, Tianrui Li, Junbo Zhang, Feng Hong, Yanwei Yu
Abstract:
Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model's learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1). Code is available at https://github.com/hanyuki23/SoP.
中文摘要:本研究提出了一种名为Socket+Plug(SoP)的通用校准策略,通过解决多目标学习冲突来提升现有深度学习模型在时间序列预测中的性能,无需开发新模型即可实现高达22%的性能提升。
English Summary: This study introduces a universal calibration strategy called Socket+Plug (SoP) that enhances existing deep learning models for time series forecasting by addressing multi-target learning conflicts, achieving up to 22% improvement without requiring new model development.
Authors:Ayush Shrivastava, Andrew Owens
Abstract:
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.
Chinese: 本研究提出了一种无需对齐训练数据即可学习不同视觉模态图像间跨模态对应关系的方法,在几何和语义匹配任务中均表现出色。
English: This study introduces a model that learns cross-modal correspondences between images from different visual modalities without requiring aligned training data, achieving robust performance in both geometric and semantic matching tasks.
Authors:Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, Yueting Zhuang
Abstract:
Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.
中文摘要:SVGenius作为首个系统性评估SVG处理的基准,发现专有模型虽优于开源模型,但所有模型均随复杂度增加而性能下降,其中推理增强训练比单纯扩大规模更有效,而风格转换仍是各类模型的最大挑战。
English Summary: SVGenius is a comprehensive benchmark for evaluating SVG processing in Large Language Models, revealing that proprietary models outperform open-source ones but all struggle with complexity and style transfer, with reasoning-enhanced training proving more effective than scaling alone.
Authors:Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
Abstract:
We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE
中文: CURE框架通过基于交互的奖励机制协同进化代码生成与单元测试能力,在Qwen2.5模型上实现代码生成准确率提升5.3%,Best-of-N准确率提升9.0%,并能有效扩展至下游任务。
English: CURE is a reinforcement learning framework that co-evolves coding and unit test generation through interaction-based rewards, improving code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% while enabling effective downstream applications.
Authors:Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao
Abstract:
We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via waveform parameters, which captures signal variations induced by different radar configurations. This formulation bypasses the need for detailed radar hardware specifications and allows efficient simulation of range-azimuth-Doppler (RAD) tensors across diverse sensor settings. We further construct a mixed real-simulated dataset with attribute annotations to robustly train the network. Extensive evaluations on multiple downstream tasks-including 2D/3D object detection and radar semantic segmentation-demonstrate that SA-Radar's simulated data is both realistic and effective, consistently improving model performance when used standalone or in combination with real data. Our framework also supports simulation in novel sensor viewpoints and edited scenes, showcasing its potential as a general-purpose radar data engine for autonomous driving applications. Code and additional materials are available at https://zhuxing0.github.io/projects/SA-Radar.
中文: SA-Radar是一种创新的雷达模拟方法,通过波形参数化嵌入融合生成式与物理建模方法,无需详细硬件规格即可高效生成可定制的雷达数据。
English: SA-Radar is a novel radar simulation method that combines generative and physics-based approaches through waveform-parameterized embeddings, enabling efficient generation of customizable radar data without requiring detailed hardware specifications.
Authors:Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu
Abstract:
Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.
Chinese: 本文提出双专家一致性模型(DCM),通过分别优化语义布局与细节增强的专家模块,结合时序一致性损失与特征匹配损失,在显著减少采样步数的同时实现了视频生成中时间连贯性与画面细节的同步提升。
English: This paper introduces the Dual-Expert Consistency Model (DCM), which addresses temporal inconsistency and detail degradation in video diffusion models by employing specialized semantic and detail experts with tailored loss functions, achieving state-of-the-art visual quality with fewer sampling steps.
Authors:Michelle Chen, David Russell, Amritha Pallavoor, Derek Young, Jane Wu
Abstract:
Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that's hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: https://github.com/open-forest-observatory/tree-detection-framework.
中文摘要:本研究证明Segment Anything Model 2 (SAM2)能够通过零样本学习有效实现遥感影像中的单木分割,展现出卓越的泛化能力与专业方法的协同效应,为突破训练数据限制提供了可行方案。
English Summary: This study demonstrates that the Segment Anything Model 2 (SAM2) effectively performs zero-shot individual tree segmentation from remote sensing imagery, showing strong generalization and synergy with specialized methods while offering a scalable alternative to data-dependent approaches.
Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Abstract:
Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.
中文: 通过破坏图像标记的连续性,Vision Transformer在跨域小样本学习中减少了对难以迁移的大空间模式的依赖,转而更有效地利用小模式,从而显著缩小了域间差距。
English: Vision Transformer's performance in cross-domain few-shot learning is improved by disrupting image token continuity, which reduces reliance on hard-to-transfer large patterns and enhances smaller pattern utilization, effectively narrowing domain gaps.
Authors:Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang
Abstract:
Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.
Chinese Summary: 本文提出了ByteMorph框架,通过包含600万图像对的大规模数据集和基于扩散变换器的基线模型,解决了基于指令的图像编辑中非刚性运动处理这一研究不足的挑战。
English Summary: The paper introduces ByteMorph, a comprehensive framework for instruction-based image editing that addresses the underexplored challenge of handling non-rigid motions through a large-scale dataset and a Diffusion Transformer-based model.
Authors:Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu
Abstract:
Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.
中文:EgoVLM是一种专为第一人称视频设计的视觉语言模型,通过强化学习微调融合视觉理解与时空推理,在基准测试中显著优于通用模型且无需监督训练。
English: EgoVLM is a specialized vision-language model that integrates visual understanding and spatiotemporal reasoning for egocentric videos, achieving superior performance on benchmarks through reinforcement learning fine-tuning without supervised training.
Authors:Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein
Abstract:
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
中文摘要:FuseLIP采用早期融合架构,通过单一Transformer处理文本与图像标记的组合,在多模态任务中表现卓越,同时在单模态任务上保持基准竞争力。
English Summary: FuseLIP introduces an early fusion architecture using a single transformer to process combined text and image tokens, achieving superior multimodal task performance while maintaining competitive unimodal capabilities.
Authors:Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai
Abstract:
Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.
Chinese: 本文提出一种新颖的模型窃取框架,通过利用可解释图神经网络的解释机制来复制目标模型的预测行为和推理模式,揭示了在敏感应用领域存在的安全隐患。
English: This paper introduces a novel model stealing framework that exploits explanations from explainable Graph Neural Networks to replicate both the predictive behavior and reasoning patterns of target models, highlighting security vulnerabilities in sensitive applications.
Authors:Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Xin Jin, Hang Zhao, Hao Zhao
Abstract:
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: https://orangesodahub.github.io/ORV
中文摘要:提出的ORV框架采用4D语义占据序列生成具有更高精度和时间一致性的逼真机器人视频,在机器人仿真任务中优于现有方法。
English Summary: The proposed ORV framework uses 4D semantic occupancy sequences to generate photorealistic robot videos with enhanced precision and temporal consistency, outperforming existing methods in robotic simulation.
Authors:Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
Abstract:
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.
中文: StreamBP是一种内存高效且精确的反向传播方法,通过逐层分解链式法则显著降低激活值内存成本,相比梯度检查点技术可将训练序列长度提升2.8-5.5倍。
English: StreamBP is a memory-efficient and exact backpropagation method that decomposes the chain rule layer-wise to significantly reduce activation memory costs while enabling 2.8-5.5 times longer sequence training compared to gradient checkpointing.
Authors:Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Abstract:
With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.
Chinese: 本研究提出了DFBench这一涵盖多种先进生成模型内容的深度伪造检测基准,并开发了MoA-DF方法,通过整合多个大型多模态模型实现了最先进的AI生成图像检测性能。
English: The study introduces DFBench, a comprehensive deepfake detection benchmark featuring diverse content from advanced generative models, and proposes MoA-DF, a novel method using multiple large multimodal models that achieves state-of-the-art performance in detecting AI-generated images.
Authors:Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, Ying Shan
Abstract:
With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.
中文摘要:本文提出HaploOmni单一多模态变换器,通过多模态预热策略和跨模态兼容技术实现高效训练,在图像视频理解与生成任务中取得优异性能。
English Summary: This paper introduces HaploOmni, a single multimodal transformer trained efficiently with a multimodal warmup strategy and cross-modal compatibility techniques to achieve competitive performance in image and video understanding and generation tasks.
Authors:Junyi Fang, Yuxun Chen, Yuxin Chen, Chen Zhang
Abstract:
The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K Ï_{\max}^2 \log T / Î$ and gap-independent regret of order $\sqrt{K T \log T}$. RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using $\sqrt{\hatÏ_k^2 / (N_k + 1)}$ in confidence bounds, (2) adaptive control via $α_t = α_0 / \log(t + ε)$, and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.
中文: RAVEN-UCB是一种针对非平稳多臂老虎机问题的新算法,通过方差感知自适应实现了更严格的遗憾界,并在多种动态环境中展现出优越性能。
English: RAVEN-UCB is a novel algorithm for non-stationary multi-armed bandit problems that achieves tighter regret bounds through variance-aware adaptation and demonstrates superior performance in various dynamic environments.
Authors:Praneet Sai Madhu Surabhi, Dheeraj Reddy Mudireddy, Jian Tao
Abstract:
This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem-solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration methodologies. Through role abstraction, generalization of meeting types for iterative collaboration, and the integration of Retrieval-Augmented Generation with advanced knowledge storage, the framework facilitates expertise creation and robust knowledge sharing. ThinkTank enables organizations to leverage collaborative AI for knowledge-intensive tasks while ensuring data privacy and security through local deployment, utilizing frameworks like Ollama with models such as Llama3.1. The ThinkTank framework is designed to deliver significant advantages in cost-effectiveness, data security, scalability, and competitive positioning compared to cloud-based alternatives, establishing it as a universal platform for AI-driven collaborative problem-solving. The ThinkTank code is available at https://github.com/taugroup/ThinkTank
ThinkTank是一个可扩展框架,通过泛化角色、会议结构和知识整合机制,将专业AI代理系统转变为支持跨领域复杂问题解决的协作平台,并利用本地部署确保数据隐私和安全。
ThinkTank is a scalable framework that transforms specialized AI agents into collaborative platforms for complex problem-solving by generalizing roles, meeting structures, and knowledge integration while ensuring data privacy through local deployment.
Authors:Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Abstract:
Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.
中文: 本研究提出了CellPuzzles基准任务,要求通过批次级推理在不同条件下分配唯一细胞类型,并开发了Cell-o1模型——一个通过监督微调和批次级奖励强化学习训练的70亿参数大语言模型,实现了最先进的性能表现。
English: The study introduces CellPuzzles, a benchmark task requiring batch-level reasoning to assign unique cell types across diverse conditions, and proposes Cell-o1, a 7B LLM that achieves state-of-the-art performance by leveraging supervised fine-tuning and reinforcement learning with batch-level rewards.
Authors:Ahmad AlMughrabi, Umair Haroon, Ricardo Marques, Petia Radeva
Abstract:
Accurate food volume estimation is crucial for dietary monitoring, medical nutrition management, and food intake analysis. Existing 3D Food Volume estimation methods accurately compute the food volume but lack for food portions selection. We present VolTex, a framework that improves \change{the food object selection} in food volume estimation. Allowing users to specify a target food item via text input to be segmented, our method enables the precise selection of specific food objects in real-world scenes. The segmented object is then reconstructed using the Neural Surface Reconstruction method to generate high-fidelity 3D meshes for volume computation. Extensive evaluations on the MetaFood3D dataset demonstrate the effectiveness of our approach in isolating and reconstructing food items for accurate volume estimation. The source code is accessible at https://github.com/GCVCG/VolTex.
Chinese: VolTex框架通过文本输入实现精准食物分割,并利用神经表面重建方法生成高保真三维网格,在MetaFood3D数据集上的广泛评估验证了其在食物体积估算中的有效性。
English: VolTex is a framework that enables precise food item selection through text input for segmentation and reconstructs them into high-fidelity 3D meshes using Neural Surface Reconstruction, achieving accurate volume estimation as validated on the MetaFood3D dataset.
Authors:Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao
Abstract:
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the reasoning trajectories of LRMs from an information-theoretic perspective. By tracking how mutual information (MI) between intermediate representations and the correct answer evolves during LRM reasoning, we observe an interesting MI peaks phenomenon: the MI at specific generative steps exhibits a sudden and significant increase during LRM's reasoning process. We theoretically analyze such phenomenon and show that as MI increases, the probability of model's prediction error decreases. Furthermore, these MI peaks often correspond to tokens expressing reflection or transition, such as ``Hmm'', ``Wait'' and ``Therefore,'' which we term as the thinking tokens. We then demonstrate that these thinking tokens are crucial for LRM's reasoning performance, while other tokens has minimal impacts. Building on these analyses, we propose two simple yet effective methods to improve LRM's reasoning performance, by delicately leveraging these thinking tokens. Overall, our work provides novel insights into the reasoning mechanisms of LRMs and offers practical ways to improve their reasoning capabilities. The code is available at https://github.com/ChnQ/MI-Peaks.
中文: 研究表明大型推理模型在解题过程中会出现互信息峰值现象,尤其在"思考标记"(如"嗯"、"因此"等)处最为显著,这些标记对模型推理性能至关重要,可被巧妙利用来提升其推理能力。
English: This study reveals that large reasoning models exhibit sudden mutual information peaks during problem-solving, particularly at "thinking tokens" like "Hmm" or "Therefore," which are crucial for accurate predictions and can be leveraged to enhance reasoning performance.
Authors:Ahsan Baidar Bakht, Muhayy Ud Din, Sajid Javed, Irfan Hussain
Abstract:
Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "https://github.com/AhsanBaidar/MVTD".
中文摘要:针对海上环境中视觉目标跟踪面临的特殊挑战,如水面反光、低对比度和动态背景等,我们推出了海上视觉跟踪数据集(MVTD),该数据集不仅能有效评估现有算法的性能,还能通过微调显著提升它们在专业场景中的跟踪效果。
English Summary: The Maritime Visual Tracking Dataset (MVTD) is introduced to address the unique challenges of visual object tracking in maritime environments, where existing algorithms struggle with reflections, low contrast, and dynamic backgrounds, but show significant improvement when fine-tuned on this specialized dataset.
Authors:Changyi Xiao, Mengdi Zhang, Yixin Cao
Abstract:
Recent studies, including DeepSeek-R1 and Kimi-k1.5, have demonstrated that reinforcement learning with rule-based, binary-valued reward functions can significantly enhance the reasoning capabilities of large language models. These models primarily utilize REINFORCE-based policy optimization techniques, such as REINFORCE with baseline and group relative policy optimization (GRPO). However, a key limitation remains: current policy optimization methods either neglect reward normalization or employ static normalization strategies, which fail to adapt to the dynamic nature of policy updates during training. This may result in unstable gradient estimates and hinder training stability. To address this issue, we propose Beta Normalization Policy Optimization (BNPO), a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters. BNPO aligns the normalization with the changing policy distribution, enabling more precise and lower-variance gradient estimation, which in turn promotes stable training dynamics. We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings. Furthermore, we introduce an advantage decomposition mechanism to extend BNPO's applicability to more complex reward systems. Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks. The code is available at https://github.com/changyi7231/BNPO.
Chinese: 本文提出Beta归一化策略优化(BNPO),这是一种通过Beta分布自适应归一化二元奖励的新方法,旨在稳定训练并增强大语言模型的推理能力,实现了最先进的性能。
English: This paper introduces Beta Normalization Policy Optimization (BNPO), a novel method that adaptively normalizes binary rewards using a Beta distribution to stabilize training and improve reasoning in large language models, achieving state-of-the-art performance.
Authors:Mingjie Wei, Xuemei Xie, Yutong Zhong, Guangming Shi
Abstract:
Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at https://github.com/MingjieWe/PGFormer.
中文摘要:本文提出的金字塔图变换器(PGFormer)通过新型金字塔图注意力模块有效捕捉跨关节和身体部位的长程依赖关系,在3D人体姿态估计任务中以更低的误差和更小的模型尺寸优于现有方法。
English Summary: The paper introduces a Pyramid Graph Transformer (PGFormer) that uses a novel Pyramid Graph Attention module to efficiently capture long-range dependencies across body joints and parts, achieving superior 3D pose estimation with lower error and smaller model size than existing methods.
Authors:Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen
Abstract:
Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.
中文摘要:本文提出了首个微重力环境视频理解基准数据集MicroG-4M,通过4,759个标注视频片段填补了地球重力条件数据集的空白,支持动作识别、视频描述和视觉问答三大核心任务。
English Summary: This paper introduces MicroG-4M, the first benchmark dataset for video understanding in microgravity environments, addressing the gap in existing Earth-centric datasets by providing 4,759 annotated clips for action recognition, video captioning, and visual question answering tasks.
Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Abstract:
Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available at https://github.com/shuaiyi308/REAP.
Chinese: 本研究发现,在跨领域小样本学习中,使用随机寄存器替代可学习提示能通过平滑损失景观提升Vision Transformer的泛化能力,并提出在图像令牌语义区域添加随机扰动的新方法,实现了最优性能。
English: This study reveals that using random registers instead of learnable prompts in Vision Transformers enhances cross-domain few-shot learning by promoting flatter loss landscapes, and proposes an improved method that applies random perturbations to image token regions for state-of-the-art performance.
Authors:Ekaterina Grishina, Mikhail Gorbunov, Maxim Rakhuba
Abstract:
Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at https://github.com/GrishKate/ProcrustesGPT
中文: 大语言模型可通过利用保持输出不变的正交变换,采用结构化矩阵进行压缩,从而实现无需微调的高效参数削减。
English: Large language models can be compressed using structured matrices by leveraging orthogonal transformations that maintain output invariance, enabling efficient parameter reduction without fine-tuning.
Authors:Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, An Fu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in code editing, substantially enhancing software development productivity. However, the inherent complexity of code editing tasks forces existing approaches to rely on LLMs' autoregressive end-to-end generation, where decoding speed plays a critical role in efficiency. While inference acceleration techniques like speculative decoding are applied to improve the decoding efficiency, these methods fail to account for the unique characteristics of code editing tasks where changes are typically localized and existing code segments are reused. To address this limitation, we propose EfficientEdit, a novel method that improves LLM-based code editing efficiency through two key mechanisms based on speculative decoding: (1) effective reuse of original code segments while identifying potential edit locations, and (2) efficient generate edit content via high-quality drafts from edit-oriented draft models and a dynamic verification mechanism that balances quality and acceleration. Experimental results show that EfficientEdit can achieve up to 10.38$\times$ and 13.09$\times$ speedup compared to standard autoregressive decoding in CanItEdit and CodeIF-Bench, respectively, outperforming state-of-the-art inference acceleration approaches by up to 90.6%. The code and data are available at https://github.com/zhu-zhu-ding/EfficientEdit.
中文摘要:EfficientEdit是一种新颖方法,通过基于推测解码重用原始代码段并生成高质量草稿,显著提升了基于大语言模型的代码编辑效率,相比现有方法实现了大幅加速。
English Summary: EfficientEdit is a novel method that enhances the efficiency of LLM-based code editing by leveraging speculative decoding to reuse original code segments and generate high-quality drafts, achieving significant speed improvements over existing approaches.
Authors:Chunwei Tian, Kai Liu, Bob Zhang, Zhixiang Huang, Chia-Wen Lin, David Zhang
Abstract:
Stable consumer electronic systems can assist traffic better. Good traffic consumer electronic systems require collaborative work between traffic algorithms and hardware. However, performance of popular traffic algorithms containing vehicle detection methods based on deep networks via learning data relation rather than learning differences in different lighting and occlusions is limited. In this paper, we present a dynamic Transformer network for vehicle detection (DTNet). DTNet utilizes a dynamic convolution to guide a deep network to dynamically generate weights to enhance adaptability of an obtained detector. Taking into relations of different information account, a mixed attention mechanism based channel attention and Transformer is exploited to strengthen relations of channels and pixels to extract more salient information for vehicle detection. To overcome the drawback of difference in an image account, a translation-variant convolution relies on spatial location information to refine obtained structural information for vehicle detection. Experimental results illustrate that our DTNet is competitive for vehicle detection. Code of the proposed DTNet can be obtained at https://github.com/hellloxiaotian/DTNet.
中文摘要:本文提出了一种用于车辆检测的动态Transformer网络DTNet,通过动态卷积和混合注意力机制提升检测器的适应性并提取显著特征,实验结果表明该方法具有竞争优势。
English Summary: This paper introduces DTNet, a dynamic Transformer network for vehicle detection that uses dynamic convolution and a mixed attention mechanism to enhance adaptability and extract salient features, demonstrating competitive performance in experiments.
Authors:Renyang Liu, Wenjie Feng, Tianwei Zhang, Wei Zhou, Xueqi Cheng, See-Kiong Ng
Abstract:
With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.
中文: 本研究针对图像生成模型遗忘中的关键问题,提出了CatIGMU任务分类框架、EvalIGMU评估体系和DataIGM数据集,发现现有算法在保持性和鲁棒性方面存在明显不足。
English: The study addresses critical gaps in image generation model unlearning by introducing CatIGMU for task categorization, EvalIGMU for evaluation, and DataIGM for benchmarking, revealing that current algorithms struggle with preservation and robustness.
Authors:Yankai Chen, Yue Que, Xinni Zhang, Chen Ma, Irwin King
Abstract:
Learning vectorized embeddings is fundamental to many recommender systems for user-item matching. To enable efficient online inference, representation binarization, which embeds latent features into compact binary sequences, has recently shown significant promise in optimizing both memory usage and computational overhead. However, existing approaches primarily focus on numerical quantization, neglecting the associated information loss, which often results in noticeable performance degradation. To address these issues, we study the problem of graph representation binarization for efficient collaborative filtering. Our findings indicate that explicitly mitigating information loss at various stages of embedding binarization has a significant positive impact on performance. Building on these insights, we propose an enhanced framework, BiGeaR++, which specifically leverages supervisory signals from pseudo-positive samples, incorporating both real item data and latent embedding samples. Compared to its predecessor BiGeaR, BiGeaR++ introduces a fine-grained inference distillation mechanism and an effective embedding sample synthesis approach. Empirical evaluations across five real-world datasets demonstrate that the new designs in BiGeaR++ work seamlessly well with other modules, delivering substantial improvements of around 1%-10% over BiGeaR and thus achieving state-of-the-art performance compared to the competing methods. Our implementation is available at https://github.com/QueYork/BiGeaR-SS.
Chinese Summary: 本研究提出了BiGeaR++增强框架,通过伪正样本和蒸馏机制减少嵌入二值化过程中的信息损失,在五个真实数据集上相比现有方法实现了1%-10%的性能提升,达到了最先进的推荐系统性能。
English Summary: This study introduces BiGeaR++, an enhanced framework for graph representation binarization that reduces information loss through pseudo-positive samples and distillation mechanisms, achieving state-of-the-art performance with 1%-10% improvements over existing methods.
Authors:Changyi Xiao, Yixin Cao
Abstract:
Knowledge graph completion (KGC) can be framed as a 3-order binary tensor completion task. Tensor decomposition-based (TDB) models have demonstrated strong performance in KGC. In this paper, we provide a summary of existing TDB models and derive a general form for them, serving as a foundation for further exploration of TDB models. Despite the expressiveness of TDB models, they are prone to overfitting. Existing regularization methods merely minimize the norms of embeddings to regularize the model, leading to suboptimal performance. Therefore, we propose a novel regularization method for TDB models that addresses this limitation. The regularization is applicable to most TDB models and ensures tractable computation. Our method minimizes the norms of intermediate variables involved in the different ways of computing the predicted tensor. To support our regularization method, we provide a theoretical analysis that proves its effect in promoting low trace norm of the predicted tensor to reduce overfitting. Finally, we conduct experiments to verify the effectiveness of our regularization technique as well as the reliability of our theoretical analysis. The code is available at https://github.com/changyi7231/IVR.
中文: 本文提出了一种新的基于张量分解的知识图谱补全模型的正则化方法,通过最小化中间变量的范数来有效减少过拟合,并辅以理论分析和实验验证。
English: This paper introduces a novel regularization method for tensor decomposition-based knowledge graph completion models that minimizes the norms of intermediate variables to effectively reduce overfitting, supported by theoretical analysis and experimental validation.
Authors:Shufan Qing, Anzhen Li, Qiandi Wang, Yuefeng Niu, Mingchen Feng, Guoliang Hu, Jinqiao Wu, Fengtao Nan, Yingchun Fan
Abstract:
Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA-SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA-SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: https://github.com/qingshufan/GeneA-SLAM2.
中文: 本文提出GeneA-SLAM2系统,通过深度方差约束消除动态物体,并采用改进的关键点算法,在高度动态环境中实现更精确的位姿估计。
English: This paper introduces GeneA-SLAM2, a system that uses depth variance constraints to remove dynamic objects and an improved keypoint algorithm for more accurate pose estimation in highly dynamic environments.
Authors:Tibor KubÃk, François Guibault, Michal Å panÄl, Hervé Lombaert
Abstract:
We introduce ToothForge, a spectral approach for automatically generating novel 3D teeth, effectively addressing the sparsity of dental shape datasets. By operating in the spectral domain, our method enables compact machine learning modeling, allowing the generation of high-resolution tooth meshes in milliseconds. However, generating shape spectra comes with the instability of the decomposed harmonics. To address this, we propose modeling the latent manifold on synchronized frequential embeddings. Spectra of all data samples are aligned to a common basis prior to the training procedure, effectively eliminating biases introduced by the decomposition instability. Furthermore, synchronized modeling removes the limiting factor imposed by previous methods, which require all shapes to share a common fixed connectivity. Using a private dataset of real dental crowns, we observe a greater reconstruction quality of the synthetized shapes, exceeding those of models trained on unaligned embeddings. We also explore additional applications of spectral analysis in digital dentistry, such as shape compression and interpolation. ToothForge facilitates a range of approaches at the intersection of spectral analysis and machine learning, with fewer restrictions on mesh structure. This makes it applicable for shape analysis not only in dentistry, but also in broader medical applications, where guaranteeing consistent connectivity across shapes from various clinics is unrealistic. The code is available at https://github.com/tiborkubik/toothForge.
中文摘要:ToothForge提出了一种光谱方法,通过同步频域嵌入有效解决牙齿形状数据集稀缺问题,无需固定网格连接即可快速生成高精度3D牙齿模型,在保持优异重建质量的同时拓展了在数字牙科及其他医疗领域的应用潜力。
English Summary: ToothForge introduces a spectral method for rapid 3D tooth generation that overcomes dataset limitations and mesh connectivity constraints by aligning spectral embeddings, achieving superior reconstruction quality and enabling broader medical applications.
Authors:Jan Robine, Marc Höftmann, Stefan Harmeling
Abstract:
What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF's connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.
Chinese: 本文提出SGF这一简洁高效的世界模型,通过自监督学习和数据增强技术,在Atari 100k基准测试中无需复杂架构即实现了优异性能。
English: This paper introduces SGF, a simple and efficient world model that uses self-supervised learning and data augmentation to achieve strong performance on the Atari 100k benchmark without complex architectures.
Authors:Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
Abstract:
Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
Chinese: 本研究提出DBG评分法,通过比较模型自评分与黄金标准判断来量化大语言模型的自偏好偏差,揭示了响应风格和训练数据等影响因素,并从注意力机制角度探讨了其潜在成因。
English: This study introduces the DBG score to measure self-preference bias in LLMs by comparing a model's self-assigned scores against gold-standard judgments, revealing how factors like response style and training data influence bias while exploring its attention-based mechanisms.
Authors:Chunwei Tian, Mingjian Song, Xiaopeng Fan, Xiangtao Zheng, Bob Zhang, David Zhang
Abstract:
Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree architecture to guide a deep network to enhance effect of key nodes to amplify the relation of hierarchical information for improving the ability of recovering images. To prevent insufficiency of the obtained structural information, cosine transform techniques in the TSRNet are used to extract cross-domain information to improve the performance of image super-resolution. Adaptive Nesterov momentum optimizer (Adan) is applied to optimize parameters to boost effectiveness of training a super-resolution model. Extended experiments can verify superiority of the proposed TSRNet for restoring high-quality images. Its code can be obtained at https://github.com/hellloxiaotian/TSRNet.
中文: 提出的TSRNet采用树引导的CNN架构,结合余弦变换技术和自适应优化器,通过强化关键节点和跨域信息提取来提升图像超分辨率效果。
English: The proposed TSRNet employs a tree-guided CNN architecture with cosine transform techniques and an adaptive optimizer to enhance image super-resolution by strengthening key nodes and cross-domain information extraction.
Authors:Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, Chenxi Liu
Abstract:
Autonomous driving (AD) has achieved significant progress, yet single-vehicle perception remains constrained by sensing range and occlusions. Vehicle-to-Everything (V2X) communication addresses these limits by enabling collaboration across vehicles and infrastructure, but it also faces heterogeneity, synchronization, and latency constraints. Language models offer strong knowledge-driven reasoning and decision-making capabilities, but they are not inherently designed to process raw sensor streams and are prone to hallucination. We propose V2X-UniPool, the first framework that unifies V2X perception with language-based reasoning for knowledge-driven AD. It transforms multimodal V2X data into structured, language-based knowledge, organizes it in a time-indexed knowledge pool for temporally consistent reasoning, and employs Retrieval-Augmented Generation (RAG) to ground decisions in real-time context. Experiments on the real-world DAIR-V2X dataset show that V2X-UniPool achieves state-of-the-art planning accuracy and safety while reducing communication cost by more than 80\%, achieving the lowest overhead among evaluated methods. These results highlight the promise of bridging V2X perception and language reasoning to advance scalable and trustworthy driving. Our code is available at: https://github.com/Xuewen2025/V2X-UniPool
中文: V2X-UniPool通过整合多模态车联网数据构建知识池,解决了自动驾驶中的感知局限和幻觉问题,显著提升了推理能力和运动规划精度,同时将传输成本降低了99.9%以上。
English: V2X-UniPool is a unified framework that addresses perception limitations and hallucination in autonomous driving by integrating multimodal V2X data into a knowledge pool, enhancing reasoning accuracy and motion planning while drastically cutting transmission costs.
Authors:Haichen Wang, Liu Yang, Xinyuan Zhang, Haomin Yu, Ming Li, Jilin Hu
Abstract:
Passenger demand forecasting helps optimize vehicle scheduling, thereby improving urban efficiency. Recently, attention-based methods have been used to adequately capture the dynamic nature of spatio-temporal data. However, existing methods that rely on heuristic masking strategies cannot fully adapt to the complex spatio-temporal correlations, hindering the model from focusing on the right context. These works also overlook the high-level correlations that exist in the real world. Effectively integrating these high-level correlations with the original correlations is crucial. To fill this gap, we propose the Aggregation Differential Transformer (ADFormer), which offers new insights to demand forecasting promotion. Specifically, we utilize Differential Attention to capture the original spatial correlations and achieve attention denoising. Meanwhile, we design distinct aggregation strategies based on the nature of space and time. Then, the original correlations are unified with the high-level correlations, enabling the model to capture holistic spatio-temporal relations. Experiments conducted on taxi and bike datasets confirm the effectiveness and efficiency of our model, demonstrating its practical value. The code is available at https://github.com/decisionintelligence/ADFormer.
中文: 提出的聚合差分变换器(ADFormer)通过差分注意力和定制聚合策略整合原始与高层时空相关性,有效提升了乘客需求预测的准确性,并在实际数据集中验证了其优越性能。
English: The proposed Aggregation Differential Transformer (ADFormer) enhances passenger demand forecasting by integrating original and high-level spatio-temporal correlations through differential attention and tailored aggregation strategies, demonstrating superior performance on real-world datasets.
Authors:Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li
Abstract:
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.
中文摘要:HATA提出了一种哈希感知的Top-k注意力机制,通过将查询和键高效映射为二进制编码,在保持模型精度的同时实现了高达7.2倍的加速效果。
English Summary: HATA introduces a hash-aware top-k attention mechanism that accelerates LLM inference by efficiently mapping queries and keys to binary codes, achieving up to 7.2× speedup while maintaining accuracy.
Authors:Timo Osterburg, Franz Albers, Christopher Diehl, Rajesh Pushparaj, Torsten Bertram
Abstract:
The fusion of sensor data is essential for a robust perception of the environment in autonomous driving. Learning-based fusion approaches mainly use feature-level fusion to achieve high performance, but their complexity and hardware requirements limit their applicability in near-production vehicles. High-level fusion methods offer robustness with lower computational requirements. Traditional methods, such as the Kalman filter, dominate this area. This paper modifies the Adapted Kalman Filter (AKF) and proposes a novel transformer-based high-level object fusion method called HiLO. Experimental results demonstrate improvements of $25.9$ percentage points in $\textrm{F}_1$ score and $6.1$ percentage points in mean IoU. Evaluation on a new large-scale real-world dataset demonstrates the effectiveness of the proposed approaches. Their generalizability is further validated by cross-domain evaluation between urban and highway scenarios. Code, data, and models are available at https://github.com/rst-tu-dortmund/HiLO .
中文: 本文提出了一种名为HiLO的基于Transformer的高层目标融合方法,通过将F1分数提升25.9个百分点、平均交并比提升6.1个百分点,显著增强了自动驾驶的环境感知能力,并在城市与高速公路场景中验证了其通用性。
English: This paper introduces HiLO, a transformer-based high-level object fusion method that enhances autonomous driving perception by improving the F1 score by 25.9 percentage points and mean IoU by 6.1 percentage points, validated across urban and highway scenarios.
Authors:Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Abstract:
In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
Chinese Summary: 本报告提出了一种新颖的三阶段框架,用于Ego4D长期动作预测任务,通过特征提取、动作识别和长期预测三个步骤,结合基础模型和大语言模型,在CVPR 2025挑战赛中取得第一名并创下最新技术水准。
English Summary: This report introduces a novel three-stage framework for the Ego4D Long-Term Action Anticipation task, which leverages foundation models for feature extraction, action recognition, and anticipation, achieving state-of-the-art results and winning first place at CVPR 2025.
Authors:Niklas Kormann, Masoud Ramuz, Zeeshan Nisar, Nadine S. Schaadt, Hendrik Annuth, Benjamin Doerr, Friedrich Feuerhake, Thomas Lampert, Johannes F. Lutzeyer
Abstract:
Graph Neural Networks (GNNs) have recently been found to excel in histopathology. However, an important histopathological task, where GNNs have not been extensively explored, is the classification of glomeruli health as an important indicator in nephropathology. This task presents unique difficulties, particularly for the graph construction, i.e., the identification of nodes, edges, and informative features. In this work, we propose a pipeline composed of different traditional and machine learning-based computer vision techniques to identify nodes, edges, and their corresponding features to form a heterogeneous graph. We then proceed to propose a novel heterogeneous GNN architecture for glomeruli classification, called HIEGNet, that integrates both glomeruli and their surrounding immune cells. Hence, HIEGNet is able to consider the immune environment of each glomerulus in its classification. Our HIEGNet was trained and tested on a dataset of Whole Slide Images from kidney transplant patients. Experimental results demonstrate that HIEGNet outperforms several baseline models and generalises best between patients among all baseline models. Our implementation is publicly available at https://github.com/nklsKrmnn/HIEGNet.git.
中文: 图神经网络在组织病理学中表现优异,提出的HIEGNet模型通过整合免疫细胞数据有效分类肾小球健康状况,在肾移植患者数据分析中优于基线模型。
English: Graph Neural Networks (GNNs) show strong performance in histopathology, and the proposed HIEGNet model effectively classifies glomeruli health by integrating immune cell data, outperforming baseline models in kidney transplant patient analysis.
Authors:Hao Yan, Handong Zheng, Hao Wang, Liang Yin, Xingchen Liu, Zhenbiao Cao, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai
Abstract:
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
Chinese: 多模态大语言模型在推理任务上取得进展,但抽象视觉推理仍面临感知局限的挑战,为此我们开发了VisuRiddles基准和感知谜题合成器,通过细粒度感知训练数据显著提升了模型性能。
English: Recent advances in multimodal large language models have improved reasoning tasks, but abstract visual reasoning remains challenging due to perception limitations, prompting the development of the VisuRiddles benchmark and the Perceptual Riddle Synthesizer to enhance model performance through fine-grained perceptual training data.
Authors:Sining Chen, Yilei Shi, Xiao Xiang Zhu
Abstract:
Monocular height estimation is considered the most efficient and cost-effective means of 3D perception in remote sensing, and it has attracted much attention since the emergence of deep learning. While training neural networks requires a large amount of data, data with perfect labels are scarce and only available within developed regions. The trained models therefore lack generalizability, which limits the potential for large-scale application of existing methods. We tackle this problem for the first time, by introducing data with imperfect labels into training pixel-wise height estimation networks, including labels that are incomplete, inexact, and inaccurate compared to high-quality labels. We propose an ensemble-based pipeline compatible with any monocular height estimation network. Taking the challenges of noisy labels, domain shift, and long-tailed distribution of height values into consideration, we carefully design the architecture and loss functions to leverage the information concealed in imperfect labels using weak supervision through balanced soft losses and ordinal constraints. We conduct extensive experiments on two datasets with different resolutions, DFC23 (0.5 to 1 m) and GBH (3 m). The results indicate that the proposed pipeline outperforms baselines by achieving more balanced performance across various domains, leading to improvements of average root mean square errors up to 22.94 %, and 18.62 % on DFC23 and GBH, respectively. The efficacy of each design component is validated through ablation studies. Code is available at https://github.com/zhu-xlab/weakim2h.
Chinese: 本研究提出了一种利用不完美标签训练单目高度估计网络的新方法,通过平衡软损失和有序约束解决数据稀缺问题并提升模型泛化能力,在多个数据集上实现了显著的误差降低。
English: This study introduces a novel pipeline that leverages imperfect labels to train monocular height estimation networks, addressing data scarcity and enhancing model generalizability through balanced soft losses and ordinal constraints, achieving significant error reductions on diverse datasets.
Authors:Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, Balaji Krishnamurthy
Abstract:
LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).
中文摘要:COLLATE框架通过训练小型语言模型从多样化输出中选择最优推理路径,在不依赖大模型的情况下显著提升了数学解题、自然语言推理和常识推理等多个领域的性能表现。
English Summary: The COLLATE framework enhances smaller language models' reasoning abilities by training them to select optimal rationales from diverse outputs, achieving superior performance across multiple domains without relying on larger models.
Authors:Gaoyang Dong, Zhicheng Zhang, Ping Sun, Minghui Zhang
Abstract:
Automated respiratory sound classification faces practical challenges from background noise and insufficient denoising in existing systems.
We propose Adaptive Differential Denoising network, that integrates noise suppression and pathological feature preservation via three innovations:
1) Adaptive Frequency Filter with learnable spectral masks and soft shrink to eliminate noise while retaining diagnostic high-frequency components;
2) A Differential Denoise Layer using differential attention to reduce noise-induced variations through augmented sample comparisons;
3) A bias denoising loss jointly optimizing classification and robustness without clean labels.
Experiments on the ICBHI2017 dataset show that our method achieves 65.53\% of the Score, which is improved by 1.99\% over the previous sota method.
The code is available in https://github.com/deegy666/ADD-RSC
Chinese: 提出的自适应差分去噪网络通过结合自适应噪声抑制与病理特征保留技术,在ICBHI2017数据集上取得了65.53%的评分,较先前最优方法提升了1.99%。
English: The proposed Adaptive Differential Denoising network enhances respiratory sound classification by integrating adaptive noise suppression with pathological feature preservation, achieving a 65.53% score on the ICBHI2017 dataset and outperforming previous methods by 1.99%.
Authors:Jiachen Liu, Rui Yu, Sili Chen, Sharon X. Huang, Hengkai Guo
Abstract:
3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driven models across multiple domains, we have curated a large-scale planar benchmark, comprising over 14 datasets and 560,000 high-resolution, dense planar annotations for diverse indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. Our code and data are available at: https://github.com/jcliu0428/ZeroPlane.
中文摘要:本文提出ZeroPlane这一基于Transformer的零样本三维平面重建框架,通过多数据集训练和创新的几何解耦方法,在跨领域场景中实现了卓越的重建性能。
English Summary: This paper introduces ZeroPlane, a Transformer-based framework for zero-shot 3D plane reconstruction from single images, which achieves superior performance across diverse domains through multi-dataset training and novel geometric disentanglement techniques.
Authors:Minghao Liu, Catherine Zhao, Nathan Zhou
Abstract:
This project develops an online, inductive recommendation system for newly listed products on e-commerce platforms, focusing on suggesting relevant new items to customers as they purchase other products. Using the Amazon Product Co-Purchasing Network Metadata dataset, we construct a co-purchasing graph where nodes represent products and edges capture co-purchasing relationships. To address the challenge of recommending new products with limited information, we apply a modified GraphSAGE method for link prediction. This inductive approach leverages both product features and the existing co-purchasing graph structure to predict potential co-purchasing relationships, enabling the model to generalize to unseen products. As an online method, it updates in real time, making it scalable and adaptive to evolving product catalogs. Experimental results demonstrate that our approach outperforms baseline algorithms in predicting relevant product links, offering a promising solution for enhancing the relevance of new product recommendations in e-commerce environments. All code is available at https://github.com/cse416a-fl24/final-project-l-minghao_z-catherine_z-nathan.git.
该项目开发了一种在线归纳推荐系统,采用改进的GraphSAGE方法分析共购数据,能在信息有限的情况下有效推荐新品,在电子商务环境中优于基线算法。
This project creates an online inductive recommendation system using a modified GraphSAGE method on co-purchasing data to effectively suggest new products with limited information, outperforming baseline approaches in e-commerce settings.
Authors:Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Abstract:
Misdiagnosis causes significant harm to healthcare systems worldwide, leading to increased costs and patient risks. MedRAG is a smart multimodal healthcare copilot equipped with powerful large language model (LLM) reasoning, designed to enhance medical decision-making. It supports multiple input modalities, including non-intrusive voice monitoring, general medical queries, and electronic health records. MedRAG provides recommendations on diagnosis, treatment, medication, and follow-up questioning. Leveraging retrieval-augmented generation enhanced by knowledge graph-elicited reasoning, MedRAG retrieves and integrates critical diagnostic insights, reducing the risk of misdiagnosis. It has been evaluated on both public and private datasets, outperforming existing models and offering more specific and accurate healthcare assistance. A demonstration video of MedRAG is available at: https://www.youtube.com/watch?v=PNIBDMYRfDM. The source code is available at: https://github.com/SNOWTEAM2023/MedRAG.
Chinese Summary: MedRAG是一款多模态医疗助手,通过增强推理能力整合语音监控和电子病历等输入信息,有效降低误诊风险,在公开和私有数据集上均表现优异。
English Summary: MedRAG is a multimodal healthcare copilot using advanced reasoning to reduce misdiagnosis risks by integrating diagnostic insights from various inputs like voice monitoring and medical records, demonstrating superior performance on datasets.
Authors:Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, Bryan Hooi
Abstract:
Computer-Use Agents (CUAs) with full system access enable powerful task automation but pose significant security and privacy risks due to their ability to manipulate files, access user data, and execute arbitrary commands. While prior work has focused on browser-based agents and HTML-level attacks, the vulnerabilities of CUAs remain underexplored. In this paper, we investigate Visual Prompt Injection (VPI) attacks, where malicious instructions are visually embedded within rendered user interfaces, and examine their impact on both CUAs and Browser-Use Agents (BUAs). We propose VPI-Bench, a benchmark of 306 test cases across five widely used platforms, to evaluate agent robustness under VPI threats. Each test case is a variant of a web platform, designed to be interactive, deployed in a realistic environment, and containing a visually embedded malicious prompt. Our empirical study shows that current CUAs and BUAs can be deceived at rates of up to 51% and 100%, respectively, on certain platforms. The experimental results also indicate that system prompt defenses offer only limited improvements. These findings highlight the need for robust, context-aware defenses to ensure the safe deployment of multimodal AI agents in real-world environments. The code and dataset are available at: https://github.com/cua-framework/agents
中文: 具备全系统访问权限的计算机使用代理易受视觉提示注入攻击,用户界面中嵌入的恶意指令可使代理受骗率高达51%(计算机代理)和100%(浏览器代理),尽管现有系统提示防御效果有限,仍需开发强健的防护机制。
English: Computer-Use Agents with full system access are vulnerable to Visual Prompt Injection attacks, where malicious instructions embedded in user interfaces can deceive agents at rates up to 51% for CUAs and 100% for Browser-Use Agents, highlighting the need for robust defenses despite limited protection from current system prompts.
Authors:Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
Abstract:
Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo_project.
中文: 本文提出了一种创新框架,通过同步扩散过程融合视觉先验与动态约束,无需依赖预定义模型即可同步生成手物交互视频与运动,显著提升了物理合理性与泛化能力。
English: This paper introduces a novel framework that integrates visual priors and dynamic constraints through synchronized diffusion to simultaneously generate hand-object interaction videos and motions, enhancing physical plausibility and generalization without relying on predefined models.
Authors:Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao
Abstract:
Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.
中文摘要:VLD框架通过模态不变语言提示和时空提示模块,生成共享文本提示并整合时空特征,在跨模态视频行人重识别中弥合模态差异,实现了最先进的性能。
English Summary: The VLD framework introduces invariant-modality language prompting and spatial-temporal prompting modules to generate shared text prompts and integrate spatiotemporal features, achieving state-of-the-art performance in cross-modality video person re-identification by bridging modality gaps.
Authors:Maryam Berijanian, Kuldeep Singh, Amin Sehati
Abstract:
Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at https://github.com/maryambrj/ALIEN.git.
中文: 本研究比较了三种基于大语言模型的关系分类智能体架构,发现多智能体协作方法持续优于标准少样本提示,并接近微调模型性能,为模块化关系提取系统提供了实用指导。
English: This study compares three AI agent architectures for entity relationship classification using large language models, finding that multi-agent coordination outperforms standard few-shot prompting and approaches fine-tuned model performance, offering practical guidance for modular LLM systems.
Authors:Nurislam Tursynbek, Hastings Greer, Basar Demir, Marc Niethammer
Abstract:
Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: https://github.com/uncbiag/dgir
中文摘要:原本用于图像生成的扩散模型能够有效识别医学图像中的语义对应关系,通过关注有意义的解剖结构并忽略无关特征,实现了更精准的可变形图像配准。
English Summary: Diffusion models trained for image generation can effectively identify semantic correspondences in medical images, enabling more accurate deformable image registration by focusing on meaningful anatomical structures while ignoring irrelevant features.
Authors:Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao Chen, Xiang Li, Ming-Ming Cheng
Abstract:
Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.
中文: 本文提出ABMILX这一新型多示例学习方法,解决了计算病理学中端到端学习的优化难题,在多个基准测试中高效超越现有最优模型。
English: This paper introduces ABMILX, a novel multiple instance learning method that addresses optimization challenges in end-to-end learning for computational pathology, outperforming state-of-the-art models efficiently across multiple benchmarks.
Authors:Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang
Abstract:
Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical issue: a substantial portion of simple tasks solved by LRMs can also be addressed by non-reasoning LLMs using significantly fewer tokens, indicating the complex reasoning may not always be necessary. To address this, we systematically analyze the reasoning trajectories of LRMs and present a method utilizing identified paradigms and LLM-Judge to classify these trajectories as either Redundant Reasoning or Essential Reasoning. And we introduce OThink-R1, a method that prunes redundant reasoning steps while preserving logical validity. OThink-R1 dynamically employs the non-thinking mode (fast-thinking) for straightforward problems while engaging in deliberate thinking (slow-thinking) for complex problems. Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23\% on average without compromising accuracy, offering practical guidelines for efficient reasoning models. The code is available at https://github.com/AgenticIR-Lab/OThink-R1.
中文: 近期大型推理模型常对简单任务进行不必要的复杂推理,因此开发了OThink-R1方法,通过修剪冗余推理步骤并动态切换快慢思考模式,在保持准确性的同时将推理冗余度平均降低近23%。
English: Recent large reasoning models often use unnecessary complex reasoning for simple tasks, so OThink-R1 was developed to prune redundant steps and switch between fast and slow thinking modes, reducing reasoning redundancy by nearly 23% without losing accuracy.
Authors:Yongxian Liu, Boyang Li, Ting Liu, Zaiping Lin, Wei An
Abstract:
Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods. Our code will be available at https://github.com/yongxianLiu/ soon.
中文: 提出的RRCA-Net通过循环使用卷积块和双交互注意力模块,在保持参数极少的同时有效检测红外小目标,并达到先进性能水平。
English: The proposed RRCA-Net utilizes reusable convolution blocks and a dual interactive attention module to efficiently detect infrared small targets while maintaining minimal parameters and achieving state-of-the-art performance.
Authors:Seulgi Kim, Ghazal Kaviani, Mohit Prabhushankar, Ghassan AlRegib
Abstract:
Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textit{Multi-level and Multi-modal Action Anticipation (m\&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08\% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.
中文摘要:本文提出m&m-Ant多模态方法,通过融合视觉文本线索与分层语义建模,将动作预测准确率较现有方法平均提升3.08%。
English Summary: The paper introduces m&m-Ant, a novel multi-modal approach that integrates visual and textual cues with hierarchical modeling to improve action anticipation accuracy by 3.08% over existing methods.
Authors:Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, Tat-Seng Chua
Abstract:
Personalizing large language models (LLMs) for individual users has become increasingly important as they are progressively integrated into real-world applications to support users' daily lives. However, existing personalization approaches often fail to distinguish which components of model predictions and training data truly reflect user preferences, leading to superficial personalization alignment. In this paper, we introduce NextQuill, a novel LLM personalization alignment framework grounded in causal preference modeling. We approach personalization from a causal perspective, treating both model predictions and ground-truth data generation as outcomes influenced by user preferences, along with other factors. We define the true preference effect as the causal impact of user history (which reflects preferences) on each token prediction or data generation instance, estimated through causal intervention techniques. Building on this insight, NextQuill introduces two complementary alignment strategies: (1) aligning model-internal causal preference effects on predictions with those reflected in ground-truth data, rather than indiscriminately fitting predictions, and (2) focusing on fitting preference-bearing tokens identified via ground-truth data preference effects, rather than treating all tokens uniformly. By integrating these strategies, NextQuill shifts the alignment process toward learning from causal preference effects, facilitating more effective and personalized adaptation. Experiments across multiple personalization benchmarks demonstrate that NextQuill significantly improves personalization quality, offering a principled, causal foundation for LLM personalization. Our codes are available on https://github.com/juntaoyou/NextQuill.
中文摘要:NextQuill通过因果偏好建模框架,在个性化对齐中聚焦真实用户偏好影响的关键词元,实现了更精准的大语言模型个性化适配,显著提升了定制化效果。
English Summary: NextQuill introduces a causal preference modeling framework that enhances LLM personalization by aligning model predictions with true user preferences through targeted token-level interventions, significantly improving adaptation quality.
Authors:Qin Xie, Qinghua Zhang, Shuyin Xia
Abstract:
Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at https://github.com/CherylTse/GBABS.
中文:本文提出一种基于粒球的近似边界采样方法,通过受限扩散避免粒球重叠,无需纯度阈值即可提升噪声数据集的分类性能。
English: This paper introduces a granular-ball-based approximate borderline sampling method that prevents overlaps through restricted diffusion and enhances classification performance on noisy datasets without requiring purity thresholds.
Authors:Liang Li, Jianli Zhao, Sheng Fang, Siyu Chen, Hui Sun
Abstract:
Hyperspectral images (HSIs) are often degraded by complex mixed noise during acquisition and transmission, making effective denoising essential for subsequent analysis. Recent hybrid approaches that bridge model-driven and data-driven paradigms have shown great promise. However, most of these approaches lack effective alternation between different priors or modules, resulting in loosely coupled regularization and insufficient exploitation of their complementary strengths. Inspired by tensor robust principal component analysis (TRPCA), we propose a novel deep unfolding network (DU-TRPCA) that enforces stage-wise alternation between two tightly integrated modules: low-rank and sparse. The low-rank module employs thresholded tensor singular value decomposition (t-SVD), providing a widely adopted convex surrogate for tensor low-rankness and has been demonstrated to effectively capture the global spatial-spectral structure of HSIs. The Top-K sparse transformer module adaptively imposes sparse constraints, directly matching the sparse regularization in TRPCA and enabling effective removal of localized outliers and complex noise. This tightly coupled architecture preserves the stage-wise alternation between low-rank approximation and sparse refinement inherent in TRPCA, while enhancing representational capacity through attention mechanisms. Extensive experiments on synthetic and real-world HSIs demonstrate that DU-TRPCA surpasses state-of-the-art methods under severe mixed noise, while offering interpretability benefits and stable denoising dynamics inspired by iterative optimization. Code is available at https://github.com/liangli97/TRPCA-Deep-Unfolding-HSI-Denoising.
中文: 提出的DU-TRPCA深度展开网络通过阶段性交替紧密整合低秩与稀疏模块,在有效去除高光谱图像混合噪声的同时超越了现有最优方法,并保持了可解释性优势。
English: The proposed DU-TRPCA deep unfolding network tightly integrates low-rank and sparse modules through stage-wise alternation, effectively removing mixed noise in hyperspectral images while surpassing state-of-the-art methods and maintaining interpretability.
Authors:Xueqi Cheng, Minxing Zheng, Shixiang Zhu, Yushun Dong
Abstract:
Model extraction attacks aim to replicate the functionality of a black-box model through query access, threatening the intellectual property (IP) of machine-learning-as-a-service (MLaaS) providers. Defending against such attacks is challenging, as it must balance efficiency, robustness, and utility preservation in the real-world scenario. Despite the recent advances, most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs. However, this assumption is increasingly unreliable, as modern models are trained on diverse datasets and attackers often operate under limited query budgets. As a result, the effectiveness of these defenses is significantly compromised in realistic deployment scenarios. To address this gap, we propose MISLEADER (enseMbles of dIStiLled modEls Against moDel ExtRaction), a novel defense strategy that does not rely on OOD assumptions. MISLEADER formulates model protection as a bilevel optimization problem that simultaneously preserves predictive fidelity on benign inputs and reduces extractability by potential clone models. Our framework combines data augmentation to simulate attacker queries with an ensemble of heterogeneous distilled models to improve robustness and diversity. We further provide a tractable approximation algorithm and derive theoretical error bounds to characterize defense effectiveness. Extensive experiments across various settings validate the utility-preserving and extraction-resistant properties of our proposed defense strategy. Our code is available at https://github.com/LabRAI/MISLEADER.
Chinese: MISLEADER是一种新型模型提取攻击防御策略,通过双层优化问题将保护建模为在保持良性输入预测精度的同时降低克隆模型可提取性,结合数据增强和异构蒸馏模型集成,无需依赖分布外假设即可实现有效防护。
English: MISLEADER is a novel defense strategy against model extraction attacks that formulates protection as a bilevel optimization problem, combining data augmentation with an ensemble of distilled models to preserve utility while resisting cloning without relying on out-of-distribution assumptions.
Authors:Andre He, Daniel Fried, Sean Welleck
Abstract:
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release
中文:强化学习算法如GRPO可能仅优化语言模型现有解题能力的分布,但通过引入罕见解奖励可纠正这种偏差,在定理证明中提升模型表现。
English: Reinforcement learning algorithms like GRPO may only refine a language model's existing problem-solving distribution rather than expanding its capabilities, but introducing an unlikeliness reward can counteract this bias and improve performance in theorem proving.
Authors:Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, Zhixiong Su
Abstract:
Misinformation detection models often rely on superficial cues (i.e., \emph{shortcuts}) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at https://github.com/whr000001/TruthOverTricks.
中文:当前虚假信息检测模型常依赖表面线索,难以应对现实场景,尤其面对大语言模型生成的虚假信息时,但提出的SMF框架通过数据增强鼓励深层语义分析,有效提升了检测的鲁棒性。
English: Current misinformation detection models often depend on superficial shortcuts that fail in real-world scenarios, especially with LLM-generated misinformation, but the proposed SMF framework enhances robustness by using data augmentation to encourage deeper semantic analysis.
Authors:Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang
Abstract:
Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes are available at https://github.com/APORduo/RLCD.
中文摘要:提出的互学习框架(RLF)结合类间分布正则化(CDR),通过主分支与辅助分支的协同优化,有效提升广义类别发现中基类与新类别的识别能力,在七个数据集上以可忽略的额外计算成本实现了最优性能。
English Summary: The proposed Reciprocal Learning Framework (RLF) with Class-wise Distribution Regularization (CDR) enhances base and novel class recognition in Generalized Category Discovery by creating a synergistic loop between main and auxiliary branches, achieving superior performance across seven datasets with minimal computational overhead.
Authors:Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni, Zhe Dong
Abstract:
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.
中文摘要:本文提出了一个包含两个数据集的新型基准,用于评估需要深度跨模态理解的混合模态图像检索,并通过实证测试和人工标注验证了其有效性。
English Summary: This paper introduces a novel benchmark with two datasets for evaluating mixed-modal image retrieval that requires deep cross-modal understanding, validated through empirical testing and human annotations.
Authors:Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen
Abstract:
Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.
中文: 当前大型语言模型的强化微调因数据冗余而效率低下,但提出的GAIN-RL框架利用模型内在的角度集中信号动态选择训练数据,显著提升了训练效率,仅用一半数据即可获得更优性能。
English: Current Reinforcement Fine-tuning for LLMs is inefficient due to redundant data exposure, but the proposed GAIN-RL framework uses the model's intrinsic angle concentration signal to dynamically select data, significantly boosting training efficiency and performance with less data.
Authors:Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen
Abstract:
Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.
中文: 当前大型语言模型的强化微调因数据冗余而效率低下,但提出的GAIN-RL框架利用模型内在的角度集中信号动态选择训练数据,显著提升了训练效率,仅用一半数据即可获得更优性能。
English: Current Reinforcement Fine-tuning for LLMs is inefficient due to redundant data exposure, but the proposed GAIN-RL framework uses the model's intrinsic angle concentration signal to dynamically select data, significantly boosting training efficiency and performance with less data.
Authors:Asha Ramanujam, Adam Elyoumi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J. Papageorgiou, Can Li
Abstract:
Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms across these environments, revealing a wide range of performance: while some tasks are tractable, others expose fundamental limitations in current approaches. SafeOR-Gym provides a challenging and practical testbed that aims to catalyze future research in safe RL for real-world decision-making problems. The SafeOR-Gym framework and all accompanying code are available at: https://github.com/li-group/SafeOR-Gym.
Chinese: SafeOR-Gym推出了包含九个运筹学环境的基准测试套件,旨在通过模拟具有结构化约束和混合决策空间的现实问题,推动安全强化学习在复杂工业场景中的应用,弥补现有机器人领域基准的不足。
English: SafeOR-Gym introduces a benchmark suite of nine operations research environments to advance safe reinforcement learning in complex, real-world domains with structured constraints and hybrid decision spaces, addressing limitations in existing robotics-focused benchmarks.
Authors:Johannes Schusterbauer, Ming Gui, Frank Fundel, Björn Ommer
Abstract:
Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at https://github.com/CompVis/diff2flow.
中文: Diff2Flow通过重新调整时间步长和对齐插值,将预训练扩散模型的知识高效迁移至流匹配,实现了无需额外计算的卓越微调性能。
English: Diff2Flow efficiently transfers knowledge from pre-trained diffusion models to flow matching by aligning their paradigms, enabling superior finetuning performance without extra computation.
Authors:Navid NaderiAlizadeh, Darian Salehi, Xinran Liu, Soheil Kolouri
Abstract:
Sliced Wasserstein (SW) distances offer an efficient method for comparing high-dimensional probability measures by projecting them onto multiple 1-dimensional probability distributions. However, identifying informative slicing directions has proven challenging, often necessitating a large number of slices to achieve desirable performance and thereby increasing computational complexity. We introduce a constrained learning approach to optimize the slicing directions for SW distances. Specifically, we constrain the 1D transport plans to approximate the optimal plan in the original space, ensuring meaningful slicing directions. By leveraging continuous relaxations of these transport plans, we enable a gradient-based primal-dual approach to train the slicer parameters, alongside the remaining model parameters. We demonstrate how this constrained slicing approach can be applied to pool high-dimensional embeddings into fixed-length permutation-invariant representations. Numerical results on foundation models trained on images, point clouds, and protein sequences showcase the efficacy of the proposed constrained learning approach in learning more informative slicing directions. Our implementation code can be found at https://github.com/Stranja572/constrainedswe.
Chinese: 该约束学习方法通过将一维传输计划与原始空间的最优计划对齐,优化了切片瓦瑟斯坦距离的切片方向,实现了基于梯度的训练,并在多种数据类型上获得了更具信息量的切片结果。
English: The proposed constrained learning approach optimizes slicing directions for Sliced Wasserstein distances by aligning 1D transport plans with the original space's optimal plan, enabling gradient-based training and yielding more informative slices across various data types.
Authors:Michael Li, Nishant Subramani
Abstract:
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how 25 models - from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) - represent lexical identity and inflectional morphology across six typologically diverse languages. Using linear and nonlinear classifiers trained on hidden activations, we predict word lemmas and inflectional features layer by layer. We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout. Additional experiments probe the nature of these encodings: attention and residual analyses examine where within layers information can be recovered, steering vector experiments test what information can be functionally manipulated, and intrinsic dimensionality analyses explore how the representational structure evolves across layers. Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime (pretrained and instruction-tuned variants). This suggests that, even with substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties are important for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing
中文: 研究表明,无论架构如何差异,Transformer语言模型都一致地在早期层线性编码词汇信息,在后期层非线性编码,同时保持屈折形态信息在各层中均匀可访问且线性可分。
English: This study reveals that transformer language models consistently encode lexical information linearly in early layers and nonlinearly in later layers, while maintaining inflectional morphology as uniformly accessible linear representations across all layers, regardless of architectural differences.
Authors:Xuefeng Jiang, Tian Wen, Zhiqin Yang, Lvhua Wu, Yufeng Chen, Sheng Sun, Yuwei Wang, Min Liu
Abstract:
In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy clients.In this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients' data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients' datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients' local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via https://github.com/Sprinter1999/MaskedOptim .
Chinese: 联邦学习面临来自不同客户端的复杂标签噪声问题,这会降低模型性能,本研究提出了MaskedOptim框架,通过检测噪声客户端并利用反向传播纠正其标签,有效提升了训练鲁棒性和数据质量。
English: Federated learning faces challenges from complex label noise across clients, which degrades model performance, and this study introduces MaskedOptim, a two-stage framework that detects noisy clients and corrects their labels via backpropagation to improve robustness and data quality.
Authors:Mengliang He, Jiayi Zeng, Yankai Jiang, Wei Zhang, Zeming Liu, Xiaoming Shi, Aimin Zhou
Abstract:
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models' performance. We publicly release our code and datasets at https://github.com/hml-github/Flow2Code.
中文摘要:本文提出了Flow2Code这一新颖的流程图代码生成评估基准,涵盖15种编程语言,实验表明当前多模态大语言模型在基于流程图的代码生成方面存在不足,而监督微调技术能显著提升模型性能。
English Summary: This paper introduces Flow2Code, a novel benchmark for evaluating flowchart-based code generation across 15 programming languages, revealing current multimodal LLMs' limitations in this task while demonstrating supervised fine-tuning's significant performance benefits.
Authors:Nikola Balic
Abstract:
Autonomous multi-agent AI systems are poised to transform various industries, particularly software development and knowledge work. Understanding current perceptions among professionals is crucial for anticipating adoption challenges, ethical considerations, and future workforce development. This study analyzes responses from 130 participants to a survey on the capabilities, impact, and governance of AI agents. We explore expected timelines for AI replacing programmers, identify perceived barriers to deployment, and examine beliefs about responsibility when agents make critical decisions. Key findings reveal three distinct clusters of respondents. While the study explored factors associated with current AI agent deployment, the initial logistic regression model did not yield statistically significant predictors, suggesting that deployment decisions are complex and may be influenced by factors not fully captured or that a larger sample is needed. These insights highlight the need for organizations to address compliance concerns (a commonly cited barrier) and establish clear governance frameworks as they integrate autonomous agents into their workflows.
中文摘要:一项针对130名专业人士的研究揭示了关于AI智能体能力和治理的三种不同观点,指出合规问题是主要部署障碍,同时发现当前采用决策缺乏单一预测因素。
English Summary: This study of 130 professionals reveals three distinct perspectives on AI agents' capabilities and governance, highlighting compliance concerns as a key deployment barrier while finding no single predictor for current adoption decisions.
Authors:Xu Zhang, Haoye Qiu, Weixuan Liang, Hui Liu, Junhui Hou, Yuheng Jia
Abstract:
Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at https://github.com/xuz2019/GPEC.
中文: 本文通过推导泛化误差界并证明特定条件下的聚类一致性,为集成聚类建立了理论基础,同时提出新算法在多个数据集上实现了6%以上的性能提升。
English: This paper establishes theoretical foundations for ensemble clustering by deriving generalization error bounds and proving consistency under certain conditions, while proposing a new algorithm that achieves state-of-the-art performance improvements across multiple datasets.
Authors:Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Hui Xiong, Enyan Dai
Abstract:
Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce $\textbf{Protap}$, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust-App-AI-Lab/protap.
中文:本文介绍了Protap基准测试,通过系统评估多种蛋白质建模方法,发现在特定任务中监督式编码器和结构信息整合能超越大规模预训练模型的性能。
English: This paper introduces Protap, a benchmark that systematically evaluates various protein modeling approaches, revealing that supervised encoders and structural information integration can outperform large-scale pretrained models in specific tasks.
Authors:Qingyu Xiao, Yuanlin Chang, Youtian Du
Abstract:
Effective agent exploration remains a core challenge in reinforcement learning (RL) for complex discrete state-space environments, particularly under partial observability. This paper presents a decoupled hierarchical RL framework integrating state abstraction (DcHRL-SA) to address this issue. The proposed method employs a dual-level architecture, consisting of a high level RL-based actor and a low-level rule-based policy, to promote effective exploration. Additionally, state abstraction method is incorporated to cluster discrete states, effectively lowering state dimensionality. Experiments conducted in two discrete customized grid environments demonstrate that the proposed approach consistently outperforms PPO in terms of exploration efficiency, convergence speed, cumulative reward, and policy stability. These results demonstrate a practical approach for integrating decoupled hierarchical policies and state abstraction in discrete grids with large-scale exploration space. Code will be available at https://github.com/XQY169/DcHRL-SA.
中文: 本文提出了一种结合状态抽象的分离式分层强化学习框架(DcHRL-SA),通过高层强化学习策略与底层规则策略的双层架构及状态聚类降维,在复杂离散网格环境中显著提升了探索效率、收敛速度和策略稳定性,实验证明其性能全面优于PPO算法。
English: This paper introduces a decoupled hierarchical reinforcement learning framework with state abstraction (DcHRL-SA) that enhances exploration efficiency in complex discrete environments by combining high-level RL policies with low-level rule-based actions and reducing state dimensionality, demonstrating superior performance over PPO in grid world experiments.
Authors:Beichen Huang, Ran Cheng, Kay Chen Tan
Abstract:
We introduce EvoGit, a decentralized multi-agent framework for collaborative software development driven by autonomous code evolution. EvoGit deploys a population of independent coding agents, each proposing edits to a shared codebase without centralized coordination, explicit message passing, or shared memory. Instead, all coordination emerges through a Git-based phylogenetic graph that tracks the full version lineage and enables agents to asynchronously read from and write to the evolving code repository. This graph-based structure supports fine-grained branching, implicit concurrency, and scalable agent interaction while preserving a consistent historical record. Human involvement is minimal but strategic: users define high-level goals, periodically review the graph, and provide lightweight feedback to promote promising directions or prune unproductive ones. Experiments demonstrate EvoGit's ability to autonomously produce functional and modular software artifacts across two real-world tasks: (1) building a web application from scratch using modern frameworks, and (2) constructing a meta-level system that evolves its own language-model-guided solver for the bin-packing optimization problem. Our results underscore EvoGit's potential to establish a new paradigm for decentralized, automated, and continual software development. EvoGit is open-sourced at https://github.com/BillHuang2001/evogit.
中文: EvoGit是一个去中心化的多智能体框架,通过基于Git的系统谱系图实现自主协作软件开发,使智能体能够异步演化代码,仅需人类进行高层次目标设定和轻量级反馈指导。
English: EvoGit is a decentralized multi-agent framework that enables autonomous collaborative software development through a Git-based phylogenetic graph, allowing agents to asynchronously evolve code with minimal human oversight for strategic goal-setting and feedback.
Authors:Xinxu Wei, Kanhao Zhao, Yong Jiao, Lifang He, Yu Zhang
Abstract:
As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or connectome features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that leverages graph contrastive learning and graph masked autoencoders for large-scale fMRI-based pre-training. BrainGFM is pre-trained on a diverse mixture of brain atlases with varying parcellations, significantly expanding the pre-training corpus and enhancing the model's ability to generalize across heterogeneous fMRI-derived brain representations. To support efficient and versatile downstream transfer, we integrate both graph prompts and language prompts into the model design, enabling BrainGFM to flexibly adapt to a wide range of atlases, neurological and psychiatric disorders, and task settings. Furthermore, we employ meta-learning to optimize the graph prompts, facilitating strong generalization to previously unseen disorders under both few-shot and zero-shot learning conditions via language-guided prompting. BrainGFM is pre-trained on 27 neuroimaging datasets spanning 25 common neurological and psychiatric disorders, encompassing 2 types of brain atlases (functional and anatomical) across 8 widely-used parcellations, and covering over 25,000 subjects, 60,000 fMRI scans, and a total of 400,000 graph samples aggregated across all atlases and parcellations. The code is available at: https://github.com/weixinxu666/BrainGFM
中文: 本文提出BrainGFM这一基于图结构的脑科学基础模型,通过图对比学习和掩码自编码器进行大规模fMRI预训练,结合图提示与语言提示实现跨多种脑疾病与任务的灵活适配。
English: This paper introduces BrainGFM, a novel graph-based foundation model for neuroscience that uses graph contrastive learning and masked autoencoders for large-scale fMRI pre-training, enabling flexible adaptation to various brain disorders and tasks through integrated graph and language prompts.
Authors:Aditya Kanade, Tanuja Ganu
Abstract:
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
中文: 多模态大语言模型存在严重视觉感知缺陷,在人类准确率达96.49%的新基准测试中其平均准确率不足50%,表明即使能给出正确答案,模型在处理精细视觉细节时仍存在根本性缺陷。
English: Multimodal Large Language Models exhibit significant visual perception deficiencies, as demonstrated by their under 50% accuracy on a new benchmark where humans achieve 96.49%, revealing critical failures in processing fine-grained visual details despite sometimes producing correct answers.
Authors:Youze Xue, Dian Li, Gang Liu
Abstract:
With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed.
Chinese: 本研究提出显式梯度放大器来增强对比学习中困难负样本的梯度,基于LLaVA-OneVision-7B架构和自研QQMM多模态大模型,在MMEB基准测试中实现了最优性能。
English: This study introduces an Explicit Gradient Amplifier to enhance hard negative sample gradients in contrastive learning, achieving state-of-the-art performance on the MMEB benchmark with both the LLaVA-OneVision-7B architecture and a proprietary MLLM called QQMM.
Authors:E Fan, Kang Hu, Zhuowen Wu, Jiangyang Ge, Jiawei Miao, Yuzhi Zhang, He Sun, Weizong Wang, Tianhan Zhang
Abstract:
Computational Fluid Dynamics (CFD) is essential for advancing scientific and engineering fields but is hindered by operational complexity, high expertise requirements, and limited accessibility. This paper introduces ChatCFD, an automated agent system for OpenFOAM simulations that processes multi-modal inputs (e.g., research papers, meshes) via an interactive interface, leveraging DeepSeek-R1 and DeepSeek-V3 large language models, a multi-agent architecture, and OpenFOAM knowledge. Its four-stage pipeline (Knowledge Base Construction, User Input Processing, Case File Generation, and Execution and Error Reflection) enables iterative trial-reflection-refinement for intricate setups, supporting diverse physical models and external meshes. Validation on 205 benchmark tutorial cases, 110 perturbed variants, and 2 literature-derived cases shows ChatCFD's 82.1 percent operational success rate on basic cases, outperforming MetaOpenFOAM (6.2 percent) and Foam-Agent (42.3 percent), and 60-80 percent on literature-derived complex cases. Turbulence model studies show a 40 percent success rate for common models versus 10 percent for rare ones like RNG k-epsilon. Physics coupling analyses reveal higher resource demands for multi-physics-coupled cases, while LLM bias toward simpler setups introduces persistent errors, such as dimensional inconsistency. Ablation studies highlight the efficacy of RAG-based modules and reflection mechanisms. By automating hypothesis testing and parameter exploration, ChatCFD accelerates scientific discovery in fluid mechanics and engineering, addressing LLM limitations through structured design and showing strong potential as a modular component in MCP-based agent networks for collaborative multi-agent systems, paving the way for scalable AI-driven CFD innovation. The code for ChatCFD is available at https://github.com/ConMoo/ChatCFD.
Chinese: ChatCFD是一种基于大语言模型的自动化代理系统,通过多阶段流程简化OpenFOAM仿真,在各类流体力学案例中取得较高成功率,其模块化设计有效克服了大模型的局限性。
English: ChatCFD is an automated agent system that simplifies OpenFOAM simulations using large language models and a multi-stage pipeline, achieving high success rates in diverse fluid dynamics cases while addressing LLM limitations through structured design.
Authors:Christopher Lee Lübbers
Abstract:
Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations.
This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes.
These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.
中文: 本研究通过采用人工排序数据和直接偏好优化(DPO)改进了复述生成,提高了准确率和人工偏好评分,并引入了检测模型和数据集以改善评估及下游应用。
English: This study improves paraphrase generation by using human-ranked data and Direct Preference Optimization (DPO), resulting in higher accuracy and human preference ratings, and introduces a detection model and dataset for better evaluation and downstream applications.
Authors:Xiao-Yang Liu Yanglet, Yupeng Cao, Li Deng
Abstract:
Financial Large Language Models (FinLLMs), such as open FinGPT and proprietary BloombergGPT, have demonstrated great potential in select areas of financial services. Beyond this earlier language-centric approach, Multimodal Financial Foundation Models (MFFMs) can digest interleaved multimodal financial data, including fundamental data, market data, data analytics, macroeconomic, and alternative data (e.g., natural language, audio, images, and video). In this position paper, presented at the MFFM Workshop joined with ACM International Conference on AI in Finance (ICAIF) 2024, we describe the progress, prospects, and challenges of MFFMs. This paper also highlights ongoing research on FinAgents in the \textbf{SecureFinAI Lab}\footnote{\https://openfin.engineering.columbia.edu/} at Columbia University. We believe that MFFMs will enable a deeper understanding of the underlying complexity associated with numerous financial tasks and data, streamlining the operation of financial services and investment processes. Github Repo https://github.com/Open-Finance-Lab/Awesome-MFFMs/.
中文:金融大语言模型在金融领域展现出潜力,而多模态金融基础模型能处理多种数据类型以深化金融理解和优化操作,哥伦比亚大学SecureFinAI实验室的研究正持续推进这一领域。
English: Financial Large Language Models show promise in finance, while Multimodal Financial Foundation Models can process diverse data types to enhance financial understanding and operations, as discussed in ongoing research at Columbia University's SecureFinAI Lab.
Authors:Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong
Abstract:
Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.
中文: 本文提出的FlashMLA-ETAP框架通过高效转置注意力管道重新配置注意力计算,在单台NVIDIA H20 GPU服务器上显著提升了DeepSeek-R1模型的多头潜在注意力推理效率,实现了最高2.78倍的加速比,同时保持了数值稳定性。
English: This paper introduces FlashMLA-ETAP, a novel framework that significantly accelerates Multi-Head Latent Attention inference for the DeepSeek-R1 model on single NVIDIA H20 GPU servers by reducing redundant computations through its Efficient Transpose Attention Pipeline, achieving up to 2.78x speedup while maintaining numerical stability.
Authors:Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen
Abstract:
Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $\texttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $\texttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $\texttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.
中文: DRAG框架通过证据和知识图谱将大型语言模型的知识提炼到小型模型中,在显著降低计算成本的同时,有效提升事实准确性并减少幻觉生成。
English: The DRAG framework effectively distills knowledge from large to small language models using evidence and knowledge graphs, significantly reducing computational costs while enhancing factual accuracy and mitigating hallucinations.
Authors:Jiajun Jiang, Yiming Zhu, Zirui Wu, Jie Song
Abstract:
We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation.Project page: https://eku127.github.io/DualMap/
中文: DualMap 是一种在线开放词汇映射系统,通过自然语言查询使机器人能在动态环境中导航,采用混合分割前端和双地图表示法,实现高效的实时语义建图与导航。
English: DualMap is an online open-vocabulary mapping system that enables robots to navigate dynamic environments using natural language queries, featuring a hybrid segmentation frontend and dual-map representation for efficient, real-time semantic mapping and navigation.
Authors:Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, Jinhui Tang
Abstract:
Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at https://github.com/muzishen/IMAGHarmony.
中文摘要:IMAGHarmony框架通过融合感知语义的协调模块和偏好引导的噪声选择策略,有效解决了多物体场景中数量与布局控制的难题,仅用少量训练数据就实现了卓越的结构一致性和语义准确性。
English Summary: The IMAGHarmony framework addresses multi-object image editing challenges by integrating a harmony-aware module and preference-guided noise selection, achieving superior structural and semantic accuracy with minimal training data.
Authors:Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang
Abstract:
Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
中文摘要:IMAGHarmony框架通过融合感知语义的协调模块和偏好引导的噪声选择策略,有效解决了多物体场景中数量与布局控制的难题,仅用少量训练数据就实现了卓越的结构一致性和语义准确性。
English Summary: The IMAGHarmony framework addresses multi-object image editing challenges by integrating a harmony-aware module and preference-guided noise selection, achieving superior structural and semantic accuracy with minimal training data.
Authors:Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
Abstract:
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
中文摘要:RoboMaster框架通过将多物体交互分解为三个子阶段并利用主导物体特征,有效解决了现有视频扩散模型在复杂机器人操作中难以捕捉多物体交互的问题,显著提升了轨迹控制视频生成的性能。
English Summary: The RoboMaster framework addresses limitations in existing video diffusion models by decomposing multi-object interactions into three distinct stages and using dominant object features to enhance visual fidelity and control in robotic manipulation tasks.
Authors:Salwa K. Al Khatib, Ahmed ElHagry, Shitong Shao, Zhiqiang Shen
Abstract:
Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code and condensed datasets are available at: https://github.com/VILA-Lab/OD3.
中文: 本文提出OD3这一专为物体检测设计的无优化数据集蒸馏框架,通过候选选择和筛选流程,在1%压缩率下以超过14%的优势刷新了COCO mAP50的最优性能。
English: This paper introduces OD3, an optimization-free dataset distillation framework for object detection that outperforms existing methods by over 14% on COCO mAP50 at 1% compression through candidate selection and screening processes.
Authors:Chi-Jane Chen, Yuhang Chen, Sukwon Yun, Natalie Stanley, Tianlong Chen
Abstract:
Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: https://github.com/UNITES-Lab/Spatial2Sentence.
中文: Spatial2Sentence框架通过将空间和表达数据转化为多句子表示,克服了现有单细胞大语言模型的局限性,在糖尿病数据集上显著提升了细胞类型分类和临床预测的准确性。
English: The proposed Spatial2Sentence framework overcomes limitations in existing single-cell large language models by integrating spatial and expression data into multi-sentence representations, achieving significant improvements in cell-type classification and clinical prediction on diabetes datasets.
Authors:Krishna Acharya, Aleksandr V. Petrov, Juba Ziani
Abstract:
We propose Generative Low-rank language model with Semantic Search (GLoSS), a generative recommendation framework that combines large language models with dense retrieval for sequential recommendation. Unlike prior methods such as GPT4Rec, which rely on lexical matching via BM25, GLoSS uses semantic search to retrieve relevant items beyond lexical matching. For query generation, we employ 4-bit quantized LlaMA-3 models fine-tuned with low-rank adaptation (LoRA), enabling efficient training and inference on modest hardware. We evaluate GLoSS on three real-world Amazon review datasets: Beauty, Toys, and Sports, and find that it achieves state-of-the-art performance. Compared to traditional ID-based baselines, GLoSS improves Recall@5 by 33.3%, 52.8%, and 15.2%, and NDCG@5 by 30.0%, 42.6%, and 16.1%, respectively. It also outperforms LLM-based recommenders such as P5, GPT4Rec, LlamaRec and E4SRec with Recall@5 gains of 4.3%, 22.8%, and 29.5%. Additionally, user segment evaluations show that GLoSS performs particularly well for cold-start users in the Amazon Toys and Sports datasets, and benefits from longer user histories in Amazon Beauty dataset, demonstrating robustness across different levels of interaction lengths.
中文: GLoSS是一种生成式推荐框架,将大型语言模型与语义搜索相结合进行序列推荐,在多个数据集上实现最优性能,显著超越现有方法,尤其对冷启动用户效果显著并能有效利用长用户历史。
English: GLoSS is a generative recommendation framework that integrates large language models with semantic search for sequential recommendations, achieving state-of-the-art performance across multiple datasets and significantly outperforming existing methods, especially benefiting cold-start users and leveraging longer user histories.
Authors:Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, Si Liu
Abstract:
Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at https://github.com/appletea233/Temporal-R1 .
中文摘要:本研究通过强化学习调优,采用双奖励机制和方差感知数据选择策略,有效提升了多模态大语言模型在视频理解中的推理能力,在八个视频任务中以更少训练数据实现了更优性能。
English Summary: This study enhances video reasoning in multimodal large language models through reinforcement learning tuning with a dual-reward system and variance-aware data selection, achieving superior performance across eight video understanding tasks with less training data.
Authors:Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu
Abstract:
Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni
中文: 针对当前多模态模型在3D理解和生成方面的局限,本研究提出了ShapeLLM-Omni这一原生3D大语言模型,通过3D VQVAE和3D-Alpaca数据集训练,实现了3D资源与文本的任意序列处理,将AI能力拓展至三维领域。
English: To address the limitations of current multimodal models in 3D understanding and generation, this study introduces ShapeLLM-Omni, a native 3D large language model that processes 3D assets and text in any sequence, trained using a 3D VQVAE and the 3D-Alpaca dataset to extend AI capabilities into the 3D domain.
Authors:Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe
Abstract:
Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.
中文: 离散语音单元通过自监督模型提取,适用于高效流式语音应用,但传统方法计算成本高昂;本研究通过缩小注意力窗口和模型规模,在保持性能的同时将浮点运算减少50%,显著提升了资源受限环境下的实时处理潜力。
English: Discrete speech units derived from self-supervised models offer efficient streaming speech applications but face impractical computational demands, which this work addresses by reducing attention windows and model size to achieve 50% fewer FLOPs with minimal performance loss, enhancing real-time processing in resource-limited settings.
Authors:Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene
Abstract:
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.
中文: SmolVLA是一种紧凑高效的视觉语言动作模型,在显著降低计算成本的同时保持竞争力,支持单GPU训练并可在消费级硬件上部署。
English: SmolVLA is a compact and efficient vision-language-action model that significantly reduces computational costs while maintaining competitive performance, enabling training on a single GPU and deployment on consumer hardware.
Authors:Zhao Yang, Jiwei Zhu, Bing Su
Abstract:
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.
中文摘要:与无监督DNA序列预训练相比,基于基因组谱监督训练的SPACE模型通过混合专家机制捕捉跨物种多谱系关联,实现了最优性能表现。
English Summary: Supervised training for genomic profile prediction is more effective than unsupervised DNA sequence pre-training, and the proposed SPACE model with Mixture of Experts achieves state-of-the-art performance by capturing cross-species and multi-profile relationships.
Authors:Sicheng Li, Chengzhen Wu, Hao Li, Xiang Gao, Yiyi Liao, Lu Yu
Abstract:
3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats, namely our Static and Dynamic GSCodec, achieving competitive rate-distortion performance in static and dynamic GS compression. The code for our framework is publicly available at https://github.com/JasonLSC/GSCodec_Studio , to advance the research on Gaussian Splats compression.
中文:GSCodec Studio是一个集成了多种3D/4D高斯泼溅重建与压缩技术的统一框架,能够高效开发静态和动态场景的紧凑表示,并实现卓越的压缩性能。
English: GSCodec Studio is a unified framework that integrates various 3D/4D Gaussian Splat reconstruction and compression techniques, enabling efficient development of compact representations for static and dynamic scenes with competitive performance.
Authors:Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Abstract:
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
中文摘要:WorldExplorer提出了一种基于自回归视频轨迹生成的新方法,通过迭代场景生成流程和三维高斯溅射优化,首次实现了在大型相机运动下保持稳定的高质量可导航三维场景,突破了传统方法在场景探索中的视角限制。
English Summary: WorldExplorer introduces an autoregressive video trajectory method that creates navigable 3D scenes with consistent visual quality across viewpoints, overcoming previous limitations in scene exploration by integrating multi-view generation with 3D Gaussian Splatting.
Authors:Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Abstract:
Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.
中文摘要:本研究提出以人为中心的评估框架,通过问题解决能力和交互体验等主观维度弥补客观评估的不足,在大量用户参与的实验中发现Grok 3表现最佳,为LLM开发提供了新的评估方法和丰富数据集。
English Summary: This study introduces a Human-Centric Evaluation framework to address the limitations of objective metrics by assessing foundation models through subjective dimensions like problem-solving and interaction quality, revealing Grok 3's top performance among tested models through extensive user-driven experiments.
Authors:Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong
Abstract:
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity$-$by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods, e.g., DPO and RAD$-$across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is $not$ $necessary$ for computing influence scores; a million-parameter model$-$with 7.5$\times$ fewer parameters$-$can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide
Chinese: 本文提出IF-Guide方法,通过影响函数主动识别并抑制训练数据中的有害标记,无需依赖人工偏好数据即可显著降低模型毒性,其效果优于现有对齐技术。
English: This paper introduces IF-Guide, a proactive method that uses influence functions to detect and suppress harmful tokens in training data, significantly reducing model toxicity without relying on human-preference data and outperforming existing alignment techniques.
Authors:Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
中文: 本立场文件提出DataRubrics框架,通过基于量规的指标和LLM驱动的评估,系统性地评判数据集质量,以解决数据研究中原创性、透明度和可复现性不足的问题。
English: This position paper proposes DataRubrics, a structured framework using rubric-based metrics and LLM-powered evaluation to systematically assess dataset quality, addressing gaps in originality, transparency, and reproducibility in data-centric research.
Authors:Yafei Yang, Zihui Zhang, Bo Yang
Abstract:
We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.
中文: 本文提出unMORE这一新型两阶段无监督方法,通过习得物体中心化表征并采用无需网络结构的推理模块,在复杂真实图像上实现了最先进的多目标分割效果,显著超越了现有方法。
English: This paper introduces unMORE, a novel two-stage unsupervised method that learns object-centric representations and uses a network-free reasoning module to achieve state-of-the-art multi-object segmentation on complex real-world images, significantly outperforming existing approaches.
Authors:Zeming Wei, Chengcan Wu, Meng Sun
Abstract:
Large Language Models (LLMs) have achieved significant success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks in generating harmful content and vulnerability to jailbreaking attacks. To analyze and monitor machine learning models, model-based analysis has demonstrated notable potential in stateful deep neural networks, yet suffers from scalability issues when extending to LLMs due to their vast feature spaces. In this paper, we propose ReGA, a model-based analysis framework with representation-guided abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are low-dimensional directions emerging in hidden states that indicate safety-related concepts, ReGA effectively addresses the scalability issue when constructing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety. Our code is available at https://github.com/weizeming/ReGA.
中文摘要:本文提出ReGA框架,通过表征引导的抽象方法构建可扩展的安全模型,有效提升大语言模型对恶意内容的识别与防御能力,在安全性和可解释性方面表现优异。
English Summary: This paper introduces ReGA, a representation-guided abstraction framework that enhances Large Language Model safety by effectively identifying and blocking harmful content through scalable model-based analysis.
Authors:Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
Abstract:
Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.
中文: 本文提出了一种多对多的统一框架,通过联合图像-视频学习和深度图条件,训练单一模型处理多种视觉生成与编辑任务,实现了可执行十余种任务的高竞争力性能。
English: This paper introduces a unified many-for-many framework that trains a single model for multiple visual generation and manipulation tasks using joint image-video learning and depth maps, achieving competitive performance with models capable of over 10 different tasks.
Authors:Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi
Abstract:
In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models. Code is available at https://github.com/hrtan/InfoMax.
中文:InfoMax是一种新颖的数据剪枝方法,通过将核心集选择构建为离散二次规划问题来最大化样本信息量并最小化冗余,实验证明其在多项任务中均表现出卓越性能。
English: InfoMax is a novel data pruning method that maximizes information content and minimizes redundancy in selected samples by formulating coreset selection as a discrete quadratic programming problem, with experiments showing its superior performance across multiple tasks.
Authors:Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu
Abstract:
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.
Chinese Summary: 本文提出StochasTok随机分词方法,通过在训练中随机分割词汇来增强大语言模型的子词理解能力,有效提升了字符计数和数学运算等任务的性能,且无需昂贵的重新预训练。
English Summary: This paper introduces StochasTok, a stochastic tokenization method that enhances large language models' subword-level understanding by randomly splitting tokens during training, improving performance on tasks like character counting and math problems without requiring costly retraining.
Authors:Chong Li, Chenglin Zhu, Tao Zhang, Mingan Lin, Zenan Zhou, Jian Xie
Abstract:
Multimodal large language models have demonstrated remarkable reasoning capabilities in various visual tasks. However, their abilities in K12 scenarios are still systematically underexplored. Previous studies suffer from various limitations including narrow subject coverage, insufficient data scale, lack of diversity in question types, and naive answer-centric evaluation method, resulting in insufficient exploration of model capabilities. To address these gaps, we propose K12Vista, the most comprehensive multimodal benchmark for Chinese K12 subject knowledge understanding and reasoning to date, featuring 33,000 questions across five core subjects from primary to high school and three question types. Moreover, beyond the final outcome, we are also concerned with the correctness of MLLMs' reasoning processes. For this purpose, we meticulously compiles errors from MLLMs' reasoning processes and leverage an automated data pipeline to construct K12-PEM-800K, the largest process evaluation dataset offering detailed step-by-step judgement annotations for MLLMs' reasoning. Subsequently, we developed K12-PEM, an advanced process evaluation model that integrates an overall assessment of both the reasoning process and answer correctness. Moreover, we also introduce K12-PEBench, the first high-quality, human-annotated benchmark specifically designed for evaluating abilities of reasoning process evaluation.Extensive experiments reveal that current MLLMs exhibit significant flaws when reasoning within K12Vista, providing critical insights for the development of more capable MLLMs.We open our resources at https://github.com/lichongod/K12Vista.
中文: 该摘要介绍了K12Vista,这是一个用于评估多模态大语言模型在中文K12学科知识理解与推理能力的综合性基准,同时构建了K12-PEM-800K和K12-PEBench进行细粒度过程评估,揭示了现有模型在推理过程中存在显著缺陷。
English: This abstract introduces K12Vista, a comprehensive multimodal benchmark for evaluating Chinese K12 subject knowledge understanding and reasoning in MLLMs, along with K12-PEM-800K and K12-PEBench for detailed process evaluation, revealing significant reasoning flaws in current models.
Authors:Sunkyung Lee, Minjin Choi, Eunseong Choi, Hye-young Kim, Jongwuk Lee
Abstract:
Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.
中文:GRAM模型通过将隐含的项目关系编码至语言模型词汇空间,并采用多粒度延迟融合高效整合丰富语义,在生成式推荐任务中相比现有方法实现了显著性能提升。
English: The proposed GRAM model enhances generative recommendation by encoding implicit item relationships into language model vocabulary and employing multi-granular late fusion to efficiently integrate rich item semantics, achieving significant performance improvements over existing methods.
Authors:Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil
Abstract:
Efficiently compiling quantum operations remains a major bottleneck in scaling quantum computing. Today's state-of-the-art methods achieve low compilation error by combining search algorithms with gradient-based parameter optimization, but they incur long runtimes and require multiple calls to quantum hardware or expensive classical simulations, making their scaling prohibitive. Recently, machine-learning models have emerged as an alternative, though they are currently restricted to discrete gate sets. Here, we introduce a multimodal denoising diffusion model that simultaneously generates a circuit's structure and its continuous parameters for compiling a target unitary. It leverages two independent diffusion processes, one for discrete gate selection and one for parameter prediction. We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts, circuit depths, and proportions of parameterized gates. Finally, by exploiting its rapid circuit generation, we create large datasets of circuits for particular operations and use these to extract valuable heuristics that can help us discover new insights into quantum circuit synthesis.
中文摘要:本文提出一种多模态去噪扩散模型,能同时生成量子电路结构和连续参数来高效编译目标酉矩阵,突破了现有方法的局限,并实现快速电路生成以构建数据集和提取启发式规则。
English Summary: This paper introduces a multimodal denoising diffusion model that simultaneously generates quantum circuit structures and continuous parameters to efficiently compile target unitaries, overcoming limitations of current methods while enabling rapid circuit generation for dataset creation and heuristic extraction.
Authors:Xuan Yu, Dayan Guan, Yanfeng Gu
Abstract:
Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}
中文: Zoom-Refine是一种无需训练的方法,通过初步响应、定位关键区域和利用细节信息自我优化的协同过程,提升多模态大语言模型对高分辨率图像的精确理解能力。
English: Zoom-Refine is a training-free method that enhances multimodal large language models' ability to interpret high-resolution images by first providing a preliminary response, identifying key regions, and then refining the answer using detailed visual information.
Authors:Zixiao Zhu, Kezhi Mao
Abstract:
Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to satisfactory performance. This often happens in applications where keywords play critical roles in the prediction of class labels. Our investigation found that the root cause of the problem is that the context-based BERT embedding of the keywords may not be discriminative enough to produce discriminative text representation for classification. Motivated by this finding, we develop a method to enhance word embeddings using domain-specific lexical knowledge. The knowledge-based embedding enhancement model projects the BERT embedding into a new space where within-class similarity and between-class difference are maximized. To implement the knowledge-based word embedding enhancement model, we also develop a knowledge acquisition algorithm for automatically collecting lexical knowledge from online open sources. Experiment results on three classification tasks, including sentiment analysis, emotion recognition and question answering, have shown the effectiveness of our proposed word embedding enhancing model. The codes and datasets are in https://github.com/MidiyaZhu/KVWEFFER.
Chinese Summary: 本研究提出了一种利用领域特定词汇知识增强BERT词嵌入的方法,通过最大化类内相似性和类间差异,有效提升了情感分析等分类任务的性能。
English Summary: The study introduces a method to enhance BERT word embeddings using domain-specific lexical knowledge, which improves classification performance in tasks like sentiment analysis by maximizing within-class similarity and between-class differences.
Authors:Karl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai. -Doss
Abstract:
Automatic speech recognition (ASR) systems struggle with dysarthric speech due to high inter-speaker variability and slow speaking rates. To address this, we explore dysarthric-to-healthy speech conversion for improved ASR performance. Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method suited for dysarthric speech. We assess its impact on ASR by training LF-MMI models and fine-tuning Whisper on converted speech. Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions, especially for more severe cases of dysarthria, while fine-tuning Whisper on converted data has minimal effect on its performance. These results highlight the potential of unsupervised rhythm and voice conversion for dysarthric ASR. Code available at: https://github.com/idiap/RnV
中文: 本研究通过在RnV转换框架中引入音节节奏建模方法,显著改善了针对构音障碍语音的自动语音识别性能,其中LF-MMI模型在严重病例上大幅降低了词错率,而微调Whisper模型效果甚微。
English: This study enhances ASR performance for dysarthric speech by introducing a syllable-based rhythm modeling method within the RnV conversion framework, achieving significant word error rate reductions in severe cases through LF-MMI models while showing minimal impact when fine-tuning Whisper.
Authors:Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian
Abstract:
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various "high-quality" speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.
Chinese: URGENT 2024挑战赛旨在推动通用稳健的语音增强技术,本文则深入分析了数据质量与评估标准等关键问题,以优化系统开发。
English: The URGENT 2024 Challenge promotes universal and robust speech enhancement techniques, while this paper analyzes critical issues like data quality and evaluation metrics to improve system development.
Authors:Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis
Abstract:
Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen
中文: EPFL-Smart-Kitchen-30数据集通过同步多模态传感器捕捉烹饪时的人类行为,并设立四项基准测试以推动生态效度行为研究的发展。
English: The EPFL-Smart-Kitchen-30 dataset captures multi-modal human movements during cooking tasks using synchronized sensors to advance behavior understanding through four proposed benchmarks.
Authors:Yuan Gan, Jiaxu Miao, Yunze Wang, Yi Yang
Abstract:
Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques. Code: https://github.com/yuangan/Silencer.
中文: 基于潜在扩散模型的说话头部动画技术进展带来了高度逼真的视频,引发了伦理担忧,为此提出了Silencer这一两阶段方法,通过忽略音频控制和抵抗净化技术来主动保护肖像隐私。
English: Advances in talking-head animation using Latent Diffusion Models create realistic videos that raise ethical concerns, leading to the development of Silencer, a two-stage method that protects portrait privacy by ignoring audio control and resisting purification techniques.
Authors:Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller
Abstract:
Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.
中文: 本文提出了一种感知引导的时间包络变形工作流程,通过人类听觉研究得出变形原则并训练机器学习模型,在生成中间音频变形方面优于现有方法。
English: This paper introduces a perceptually guided workflow for temporal envelope morphing, using human listening studies to derive principles and training machine learning models that outperform existing methods in producing intermediate audio morphs.
Authors:Yiming Zhong, Yumeng Liu, Chuyang Xiao, Zemin Yang, Youzhuo Wang, Yufei Zhu, Ye Shi, Yujing Sun, Xinge Zhu, Yuexin Ma
Abstract:
Learning effective visuomotor policies for robotic manipulation is challenging, as it requires generating precise actions while maintaining computational efficiency. Existing methods remain unsatisfactory due to inherent limitations in the essential action representation and the basic network architectures. We observe that representing actions in the frequency domain captures the structured nature of motion more effectively: low-frequency components reflect global movement patterns, while high-frequency components encode fine local details. Additionally, robotic manipulation tasks of varying complexity demand different levels of modeling precision across these frequency bands. Motivated by this, we propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components. To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space. Extensive experiments across diverse 2D and 3D robotic manipulation benchmarks demonstrate that our approach outperforms existing methods in both accuracy and efficiency, showcasing the potential of a frequency-domain autoregressive framework with continuous tokens for generalized robotic manipulation.Code is available at https://github.com/4DVLab/Freqpolicy
Chinese: 本文提出了一种频域自回归框架,通过连续潜在表示渐进建模分层频率分量,有效提升了机器人视觉运动策略在操作任务中的精确度和效率。
English: This paper introduces a frequency-domain autoregressive framework with continuous latent representations for robotic visuomotor policy learning, which progressively models hierarchical frequency components to enhance both accuracy and efficiency in manipulation tasks.
Authors:Roman Plaud, Alexandre Perez-Lebel, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald
Abstract:
Hierarchical classification offers an approach to incorporate the concept of mistake severity by leveraging a structured, labeled hierarchy. However, decoding in such settings frequently relies on heuristic decision rules, which may not align with task-specific evaluation metrics. In this work, we propose a framework for the optimal decoding of an output probability distribution with respect to a target metric. We derive optimal decision rules for increasingly complex prediction settings, providing universal algorithms when candidates are limited to the set of nodes. In the most general case of predicting a subset of nodes, we focus on rules dedicated to the hierarchical $hF_β$ scores, tailored to hierarchical settings. To demonstrate the practical utility of our approach, we conduct extensive empirical evaluations, showcasing the superiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods to enhance the performance and reliability of hierarchical classifiers in real-world applications. The code is available at https://github.com/RomanPlaud/hierarchical_decision_rules
中文摘要:本文提出了一种针对层次分类的最优解码框架,通过将决策规则与任务特定指标对齐,在实证评估中展现出优越性能。
English Summary: This paper introduces an optimal decoding framework for hierarchical classification that aligns decision rules with task-specific metrics, demonstrating improved performance in empirical evaluations.
Authors:Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang
Abstract:
Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.
中文: 本文提出EvolveNav自演进框架,通过链式思维监督微调和自反思后训练两阶段方法,利用大语言模型的推理能力提升视觉语言导航任务的准确性与可解释性。
English: This paper introduces EvolveNav, a self-improving framework that enhances vision-language navigation by training LLMs with formalized chain-of-thought reasoning and iterative self-reflective post-training to boost both accuracy and interpretability.
Authors:Wenhao Liu, Zhenyi Lu, Xinyu Hu, Jierui Zhang, Dailin Li, Jiacheng Cen, Huilin Cao, Haiteng Wang, Yuhan Li, Kun Xie, Dandan Li, Pei Zhang, Chengbo Zhang, Yuxiang Ren, Xiaohong Huang, Yan Ma
Abstract:
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.
中文: STORM-BORN数据集通过从学术论文中提取极具挑战性的数学推导问题,结合人类式推理线索和多智能体生成框架,解决了现有数据集内容陈旧和可靠性不足的问题,即使最先进模型也仅能解决不足5%的题目,但使用该数据集微调可显著提升模型性能。
English: The STORM-BORN dataset addresses limitations in existing math datasets by providing ultra-challenging problems derived from academic papers with human-like reasoning cues, and its multi-agent generation framework ensures high quality, significantly boosting model performance despite low solve rates by advanced models.
Authors:Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang
Abstract:
Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at https://github.com/deep-kaixun/APA.
中文摘要:本文提出APA框架,通过解耦视觉一致性与攻击效果的冲突性偏好,实现扩散模型与对抗性偏好的对齐,在保持图像质量的同时显著提升攻击迁移性。
English Summary: This paper introduces APA, a two-stage framework that aligns diffusion models with adversary preferences by decoupling conflicting visual and attack objectives, achieving superior transferability while preserving image quality.
Authors:Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Abstract:
Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.
中文摘要:本研究提出基于中国核心价值观的分层价值框架,构建大规模中文价值观语料库(CVC)以解决大语言模型价值对齐中的文化偏见问题,实验证明CVC在价值边界界定和文化适应性方面表现优异,并为价值评估提供可扩展的基准场景。
English Summary: This study introduces a hierarchical Chinese values framework and constructs a large-scale Chinese Values Corpus (CVC) to address cultural biases in LLM alignment, demonstrating through experiments that CVC effectively enhances value boundary definition and cultural relevance while providing scalable evaluation scenarios.
Authors:Long Yao, Wenzhong Yang, Yabo Yin, Fuyuan Wei, Hongzhen Lv, Jiaren Peng, Liejun Wang, Xiaoming Tao
Abstract:
Cross-document Event Coreference Resolution (CD-ECR) is a fundamental task in natural language processing (NLP) that seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence. However, current CD-ECR approaches predominantly rely on trigger features within input mention pairs, which induce spurious correlations between surface-level lexical features and coreference relationships, impairing the overall performance of the models. To address this issue, we propose a novel cross-document event coreference resolution method based on Argument-Centric Causal Intervention (ACCI). Specifically, we construct a structural causal graph to uncover confounding dependencies between lexical triggers and coreference labels, and introduce backdoor-adjusted interventions to isolate the true causal effect of argument semantics. To further mitigate spurious correlations, ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information. In contrast to prior methods that depend on costly data augmentation or heuristic-based filtering, ACCI enables effective debiasing in a unified end-to-end framework without altering the underlying training procedure. Extensive experiments demonstrate that ACCI achieves CoNLL F1 of 88.4% on ECB+ and 85.2% on GVC, achieving state-of-the-art performance. The implementation and materials are available at https://github.com/era211/ACCI.
中文: 本文提出了一种新颖的跨文档事件共指消解方法ACCI,通过基于论据的因果干预消除词汇触发词的伪相关,在基准测试中实现了最先进的性能。
English: This paper introduces a novel cross-document event coreference resolution method called ACCI, which uses argument-centric causal intervention to eliminate spurious correlations from lexical triggers and achieves state-of-the-art performance on benchmark datasets.
Authors:Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Abstract:
Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers -- Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame grid.We validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting engineers.Specifically, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at https://github.com/RS2002/Skip-BART .
中文: 本文提出Skip-BART端到端生成模型,通过向专业灯光师学习直接将音频音乐转化为灯光效果,在性能上超越传统规则方法并接近人类灯光师水平。
English: This paper introduces Skip-BART, an end-to-end generative model that directly converts audio music into lighting cues by learning from professional lighting engineers, outperforming traditional rule-based methods and closely matching human performance.
Authors:Matthew D. Fuchs
Abstract:
Policies are designed to distinguish between correct and incorrect actions; they are types. But badly typed actions may cause not compile errors, but financial and reputational harm We demonstrate how even the most complex ABAC policies can be expressed as types in dependently typed languages such as Agda and Lean, providing a single framework to express, analyze, and implement policies. We then go head-to-head with Rego, the popular and powerful open-source ABAC policy language. We show the superior safety that comes with a powerful type system and built-in proof assistant. In passing, we discuss various access control models, sketch how to integrate in a future when attributes are distributed and signed (as discussed at the W3C), and show how policies can be communicated using just the syntax of the language. Our examples are in Agda.
中文:本研究证明,在依赖类型语言(如Agda和Lean)中,复杂的基于属性的访问控制策略可被编码为类型,相比传统策略语言Rego,其强大的类型系统和内置证明助手能提供更高级别的安全保障。
English: This research demonstrates that complex Attribute-Based Access Control (ABAC) policies can be effectively encoded as types in dependently typed languages like Agda and Lean, offering enhanced safety through powerful type systems and built-in proof assistants compared to traditional policy languages like Rego.
Authors:Jakob Schmid, Azin Jahedi, Noah Berenguel Senn, Andrés Bruhn
Abstract:
Although multi-scale concepts have recently proven useful for recurrent network architectures in the field of optical flow and stereo, they have not been considered for image-based scene flow so far. Hence, based on a single-scale recurrent scene flow backbone, we develop a multi-scale approach that generalizes successful hierarchical ideas from optical flow to image-based scene flow. By considering suitable concepts for the feature and the context encoder, the overall coarse-to-fine framework and the training loss, we succeed to design a scene flow approach that outperforms the current state of the art on KITTI and Spring by 8.7%(3.89 vs. 4.26) and 65.8% (9.13 vs. 26.71), respectively. Our code is available at https://github.com/cv-stuttgart/MS-RAFT-3D.
中文: 我们提出了一种多尺度循环场景流方法,将光流中的分层理念成功应用于图像场景流,在KITTI和Spring基准测试中以显著降低误差的表现超越了现有最优方法。
English: We introduce a multi-scale recurrent scene flow method that adapts hierarchical concepts from optical flow, achieving state-of-the-art performance on KITTI and Spring benchmarks with significant error reductions.
Authors:Rafael Flor-RodrÃguez, Carlos Gutiérrez-Ãlvarez, Francisco Javier Acevedo-RodrÃguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre
Abstract:
Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce a newly curated dataset, i.e. the SEMNAV dataset, designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. We release SEMNAV dataset, code and trained models at https://github.com/gramuah/semnav
Chinese: SEMNAV是一种新颖的视觉语义导航方法,利用语义分割作为视觉输入来增强感知和决策能力,通过提升泛化性和缩小仿真与现实差距,在模拟和真实环境中均实现了卓越性能。
English: SEMNAV is a novel visual semantic navigation approach that uses semantic segmentation as visual input to enhance perception and decision-making, achieving superior performance in both simulation and real-world settings by improving generalization and bridging the sim-to-real gap.
Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Abstract:
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
中文: 现有大语言模型在处理包含多重约束的复杂指令时存在困难,而提出的RAIF方法通过基于可验证奖励的强化学习来增强推理能力,显著提升了小规模模型的性能表现。
English: Current large language models struggle with complex instructions containing multiple constraints, but the proposed RAIF method uses reinforcement learning with verifiable rewards to enhance reasoning capabilities, significantly improving performance even in smaller models.
Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Abstract:
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
中文: 现有大语言模型在处理包含多重约束的复杂指令时存在困难,而提出的RAIF方法通过基于可验证奖励的强化学习来增强推理能力,显著提升了小规模模型的性能表现。
English: Current large language models struggle with complex instructions containing multiple constraints, but the proposed RAIF method uses reinforcement learning with verifiable rewards to enhance reasoning capabilities, significantly improving performance even in smaller models.
Authors:Minjeong Park, Hongbeen Park, Jinkyu Kim
Abstract:
The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model's robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at https://github.com/mlnjeongpark/ViTA-PAR.
中文:ViTA-PAR模型通过多模态提示和视觉-文本属性对齐技术,结合全局到局部的语义捕捉与上下文感知特征融合,在四个基准测试中实现了优越的行人属性识别性能。
English: The proposed ViTA-PAR model enhances pedestrian attribute recognition by integrating visual and textual attribute alignment with multimodal prompting, achieving competitive performance across four benchmarks through global-to-local semantic capture and context-aware feature fusion.
Authors:Xiang Zhao, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma
Abstract:
The identification of drug-target interactions (DTI) is critical for drug discovery and repositioning, as it reveals potential therapeutic uses of existing drugs, accelerating development and reducing costs. However, most existing models focus only on direct similarity in homogeneous graphs, failing to exploit the rich similarity in heterogeneous graphs. To address this gap, inspired by real-world social interaction behaviors, we propose SOC-DGL, which comprises two specialized modules: the Affinity-Driven Graph Learning (ADGL) module, learning global similarity through an affinity-enhanced drug-target graph, and the Equilibrium-Driven Graph Learning (EDGL) module, capturing higher-order similarity by amplifying the influence of even-hop neighbors using an even-polynomial graph filter based on balance theory. This dual approach enables SOC-DGL to effectively capture similarity information across multiple interaction scales within affinity and association matrices. To address the issue of imbalance in DTI datasets, we propose an adjustable imbalance loss function that adjusts the weight of negative samples by the parameter. Extensive experiments on four benchmark datasets demonstrate that SOC-DGL consistently outperforms existing state-of-the-art methods across both balanced and imbalanced scenarios. Moreover, SOC-DGL successfully predicts the top 9 drugs known to bind ABL1, and further analyzed the 10th drug, which has not been experimentally confirmed to interact with ABL1, providing supporting evidence for its potential binding.
中文: 该研究提出了SOC-DGL模型,通过亲和力与均衡驱动的双模块捕捉药物-靶点相互作用中的多尺度相似性,并采用可调节损失函数解决数据不平衡问题,在实验中展现出优越性能及精准预测能力。
English: The study introduces SOC-DGL, a novel model that captures multi-scale similarities in drug-target interactions using affinity and equilibrium-driven modules, and it addresses dataset imbalance with an adjustable loss function, demonstrating superior performance and accurate predictions in validation experiments.
Authors:Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun
Abstract:
The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.
中文: 大型语言模型代理在自动化图形界面任务方面展现出潜力,但面临训练数据噪声大、泛化能力差等挑战;AgentCPM-GUI通过增强训练流程和精简动作空间,在多个基准测试中取得了领先性能。
English: Large language model agents show promise for automating GUI tasks, yet face challenges like noisy training data and limited generalization, which AgentCPM-GUI addresses through a robust training pipeline and compact action space to achieve top performance on benchmarks.
Authors:Minghao Xu, Jiaze Song, Keming Wu, Xiangxin Zhou, Bin Cui, Wentao Zhang
Abstract:
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources at https://github.com/kasawa1234/GlycanAA
中文: GlycanAA模型通过将糖链表示为包含单糖骨架结构和原子级细节的异质图,采用分层信息传递和预训练技术,显著提升了糖链性质预测的性能。
English: The GlycanAA model introduces an all-atom approach to glycan modeling by representing glycans as heterogeneous graphs that capture both monosaccharide-level backbone structures and atomic-level details, achieving superior performance through hierarchical message passing and pre-training enhancements.
Authors:Tomasz Stanczyk, Seongro Yoon, Francois Bremond
Abstract:
Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at https://github.com/tstanczyk95/McByte.
中文:McByte是一种无需训练的多目标跟踪框架,通过整合传播的分割掩码作为关联线索来增强鲁棒性,在多种体育和行人跟踪数据集上表现出色,且无需针对每个视频进行调整。
English: McByte is a training-free multi-object tracking framework that enhances robustness by integrating propagated segmentation masks for association, demonstrating strong performance across various sports and pedestrian tracking datasets without requiring per-video tuning.
Authors:Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An
Abstract:
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance during inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating its capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling. Our code is available at https://github.com/mansicer/self-verification.
中文: 大语言模型因特定后训练生成器与通用奖励模型间的分布差异导致扩展收益有限,我们提出的自验证框架通过单一强化学习过程统一答案生成与验证,无需外部验证器即可实现有效的测试时性能扩展。
English: Large Language Models achieve limited gains from post-training scaling due to distribution mismatches with external reward models, but our proposed self-verification framework unifies generation and verification in a single RL process to enable effective test-time scaling without external verifiers.
Authors:Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin
Abstract:
Relational databases (RDBs) are composed of interconnected tables, where relationships between them are defined through foreign keys. Recent research on applying machine learning to RDBs has explored graph-based representations of RDBs, where rows of tables are modeled as nodes, and foreign key relationships are modeled as edges. RDB-to-graph modeling helps capture cross-table dependencies, ultimately leading to enhanced performance across diverse tasks. However, there are numerous ways to model RDBs as graphs, and performance varies significantly depending on the chosen graph model. In our analysis, applying a common heuristic rule for graph modeling leads to up to a 10% drop in performance compared to the best-performing graph model, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 9 automatic RDB-to-graph modeling methods on the 12 tasks over 600x faster than on-the-fly evaluation, which requires repeated model training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.
中文: 近期研究将关系数据库建模为图以捕捉跨表依赖关系,但不同图模型的性能差异显著,为此我们推出了首个基准框架RDB2G-Bench,用于高效可复现地评估这些方法。
English: Recent research models relational databases as graphs to capture cross-table dependencies, but performance varies significantly with different graph models, prompting the introduction of RDB2G-Bench as the first benchmark framework for efficient and reproducible evaluation of these methods.
Authors:Zhiyang Qi, Takumasa Kaneko, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
Abstract:
Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at https://github.com/UEC-InabaLab/KokoroChat.
中文: 本研究通过专业咨询师角色扮演构建了KokoroChat日语心理辅导数据集,解决了现有大模型生成数据在多样性和真实性上的不足,实验表明使用该数据集微调模型能显著提升辅导回复质量和对话评估效果。
English: This study introduces KokoroChat, a Japanese psychological counseling dataset created through role-playing by trained counselors to overcome limitations in diversity and authenticity of existing LLM-generated data, enhancing both response quality and dialogue evaluation when fine-tuning models.
Authors:Haoyu Li, Xiangru Zhong, Bin Hu, Huan Zhang
Abstract:
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimations of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their framework. In this work, we propose a novel two-stage training framework to jointly synthesize the controller and Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training data sampling strategy and a domain updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $α,\!β$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
中文: 本文提出了一种两阶段训练框架,通过联合合成控制器与李雅普诺夫函数,在连续时间系统中实现了比现有方法大得多的吸引域估计和快数千倍的验证速度。
English: This paper introduces a two-stage training framework that synthesizes neural network controllers and Lyapunov functions for continuous-time systems, achieving significantly larger regions of attraction and faster verification compared to existing methods.
Authors:Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
Abstract:
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@$1$ but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.
中文: 带有可验证奖励的强化学习(RLVR)通过惩罚错误答案来有效训练语言模型在推理任务上的表现,这种方法能抑制错误生成并重新分配概率给其他合理选项,在多项评估指标上常达到或超越传统方法的效果。
English: Reinforcement learning with verifiable rewards (RLVR) effectively trains language models on reasoning tasks by penalizing incorrect responses, which suppresses wrong answers and redistributes probability to other plausible options, often matching or surpassing traditional methods while improving performance across various evaluation metrics.
Authors:Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, Heung-Yeung Shum
Abstract:
Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior's parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at https://github.com/HKUST-SAIL/NoiseAR/
NoiseAR introduces an autoregressive method to learn dynamic and controllable initial noise priors for diffusion models, enabling text-conditioned control and improved sample quality through structured initialization.
English Summary:
Authors:Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Abstract:
Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at https://github.com/zhang0jhon/diffusion-4k.
中文: 本文提出了包含GPT-4o生成描述的Aesthetic-4K数据集和采用SC-VAE与WLF技术的Diffusion-4K框架,能够直接生成超高清图像,并通过新指标验证了其在大规模模型支持下取得的优异合成效果。
English: This paper introduces Aesthetic-4K, a curated dataset with GPT-4o captions, and Diffusion-4K, a framework using SC-VAE and WLF for direct ultra-high-resolution image synthesis, validated by novel metrics and achieving impressive results with large-scale models.
Authors:Haiyang Mei, Pengyu Zhang, Mike Zheng Shou
Abstract:
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM's static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2's performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: https://github.com/showlab/SAM-I2V.
中文: SAM-I2V 提出了一种资源高效的图像到视频升级方法,可将预训练的 SAM 模型转化为可提示视频分割模型,仅用 0.2% 的训练成本即可实现接近顶尖水平的性能。
English: SAM-I2V offers a resource-efficient method to upgrade the Segment Anything Model for promptable video segmentation, achieving near state-of-the-art performance with drastically reduced training costs.
Authors:Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng
Abstract:
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.
中文:潜在结构Hopfield网络(LSHN)是一种受生物学启发的模型,将Hopfield吸引子动力学融入自编码器,通过动态绑定语义元素形成情景记忆,在多个数据集上展现出对受损输入的卓越回忆性能。
English: The Latent Structured Hopfield Network (LSHN) is a biologically inspired model that integrates Hopfield attractor dynamics into an autoencoder to dynamically bind semantic elements into episodic memories, demonstrating superior performance in recalling corrupted inputs across multiple datasets.
Authors:Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang
Abstract:
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
中文摘要:本文提出SICQL框架,通过结合动态规划和世界建模,有效提升了上下文强化学习的性能与泛化能力,尤其在从次优轨迹中学习时表现突出。
English Summary: The paper introduces SICQL, a scalable in-context Q-learning framework that combines dynamic programming and world modeling to enhance reinforcement learning performance and generalization, particularly when learning from suboptimal trajectories.
Authors:Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang
Abstract:
Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
中文摘要:本文提出SICQL框架,通过结合动态规划和世界建模,有效提升了上下文强化学习的性能与泛化能力,尤其在从次优轨迹中学习时表现突出。
English Summary: The paper introduces SICQL, a scalable in-context Q-learning framework that combines dynamic programming and world modeling to enhance reinforcement learning performance and generalization, particularly when learning from suboptimal trajectories.
Authors:Ya Wen, Jixuan Cai, Qiyao Ma, Linyan Li, Xinhua Chen, Chris Webster, Yulun Zhou
Abstract:
Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective and scalable multimodal fusion. Adopting a novel CLIP-based architecture, our framework aligns 100M+ POIs, nationwide remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph. By tokenizing spatial locations into grid cells inspired by Vision Transformers, we establish a unified representation space bridging mobility patterns and multimodal features. To rigorously evaluate the general-purpose effectiveness of MobCLIP, we construct a benchmark dataset composed of 11 downstream prediction tasks across social, economic, and natural domains. Experiments show that MobCLIP, with four input modalities and a compact 128-dimensional representation space, achieves significantly superior general-purpose predictive performances than state-of-the-art models by an average of 35%. Thanks to the effective integration of human-centric modalities, the performance gain is particularly profound in human-centric tasks, such as energy consumption (+260%), offline retail consumption amount (+98%), and crime cases (+95%) predictions. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: https://github.com/ylzhouchris/MobCLIP.
中文:MobCLIP提出了一种全国通用的位置编码器,通过基于CLIP的架构融合多种数据模态,将移动图与兴趣点、遥感图像和人口统计数据对齐,在多样化任务中实现了预测性能平均提升35%的显著效果。
English: MobCLIP introduces a versatile nationwide location encoder that integrates multiple data modalities through a CLIP-based architecture, achieving a 35% average improvement in predictive performance across diverse tasks by aligning mobility graphs with POIs, imagery, and demographics.
Authors:Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Min Zhang, Wen Zhang, Huajun Chen
Abstract:
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Our code and data are released at https://github.com/zjukg/M3STR
中文: M3STR是一种新型基准测试,旨在评估多模态大语言模型理解视觉形式结构化知识的能力,尽管对26种先进模型进行了广泛测试,结果仍显示其推理能力存在显著不足。
English: M3STR is a novel benchmark designed to evaluate multi-modal large language models' ability to comprehend structured knowledge in visual form, revealing significant gaps in their reasoning capacities despite extensive testing of 26 advanced models.
Authors:Yudong Lu, Yazhe Niu, Shuai Hu, Haolin Wang
Abstract:
CleanS2S is a framework for human-like speech-to-speech interaction that advances conversational AI through single-file implementation and proactive dialogue capabilities. Our system integrates automatic speech recognition, large language models, and text-to-speech synthesis into a unified pipeline with real-time interruption handling, achieving low transition latency through full-duplex websocket connections and non-blocking I/O. Beyond conventional chatbot paradigms, we pioneer a proactive interaction mechanism, which combines memory systems with Subjective Action Judgement module, enabling five human-like response strategies: interruption, refusal, deflection, silence, and standard response. The memory module dynamically aggregates historical, and contextual data to inform interaction decisions. This approach breaks the rigid turn-based convention by allowing system-initiated dialog control and context-aware response selection. And we propose Action Judgement SFT that assesses input streams for responses strategies. The framework's single-file implementation with atomic configurations offers researchers unprecedented transparency and extensibility for interaction agents. The code of CleanS2S is released at \https://github.com/opendilab/CleanS2S.
Chinese: CleanS2S是一个类人语音交互框架,通过统一流程集成实时打断处理和主动对话机制,实现灵活响应策略,并为对话AI提供透明可扩展的单文件实现方案。
English: CleanS2S is a human-like speech-to-speech interaction framework that integrates real-time interruption handling and proactive dialogue mechanisms through a unified pipeline, enabling flexible response strategies and offering transparent, extensible implementation for conversational AI.
Authors:Jisoo Mok, Ik-hwan Kim, Sangkwon Park, Sungroh Yoon
Abstract:
Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at https://github.com/12kimih/HiCUPID.
中文: 个性化AI助手是大型语言模型研究中的关键挑战,为解决缺乏开源对话数据的问题,HiCUPID基准被推出,它包含一个数据集和一个基于Llama-3.2的自动评估模型,其评估结果与人类偏好高度一致。
English: Personalized AI assistants represent a key challenge in LLM research, and to overcome the lack of open-source conversational data for personalization, the HiCUPID benchmark is introduced, which includes a dataset and an automated evaluation model based on Llama-3.2 that aligns with human preferences.
Authors:Yimin Du
Abstract:
FastText has established itself as a fundamental algorithm for learning word representations, demonstrating exceptional capability in handling out-of-vocabulary words through character-level n-gram embeddings. However, its hash-based bucketing mechanism introduces critical limitations for large-scale industrial deployment: hash collisions cause semantic drift, and memory requirements become prohibitively expensive when dealing with real-world vocabularies containing millions of terms. This paper presents a comprehensive memory optimization framework that fundamentally reimagines FastText's memory management through the integration of double-array trie (DA-trie) structures and mark-compact garbage collection principles. Our approach leverages the linguistic insight that n-grams sharing common prefixes or suffixes exhibit highly correlated embeddings due to co-occurrence patterns in natural language. By systematically identifying and merging semantically similar embeddings based on structural relationships, we achieve compression ratios of 4:1 to 10:1 while maintaining near-perfect embedding quality. The algorithm consists of four sophisticated phases: prefix trie construction with embedding mapping, prefix-based similarity compression, suffix-based similarity compression, and mark-compact memory reorganization. Comprehensive experiments on a 30-million Chinese vocabulary dataset demonstrate memory reduction from over 100GB to approximately 30GB with negligible performance degradation. Our industrial deployment results show significant cost reduction, faster loading times, and improved model reliability through the elimination of hash collision artifacts. Code and experimental implementations are available at: https://github.com/initial-d/me_fasttext
中文摘要:本文提出了一种针对FastText的内存优化框架,通过结合双数组字典树和标记-压缩原理来识别语义相似的n-gram进行嵌入压缩,在保持性能的同时实现了4:1至10:1的压缩比。
English Summary: This paper introduces a memory optimization framework for FastText that combines double-array tries and mark-compact principles to compress embeddings by identifying semantically similar n-grams, achieving 4:1 to 10:1 compression ratios while preserving performance.
Authors:Shufeng Kong, Xingru Yang, Yuanyuan Wei, Zijie Wang, Hao Tang, Jiuqi Qin, Shuting Lan, Yingheng Wang, Junwen Bai, Zhuangbin Chen, Zibin Zheng, Caihua Liu, Hao Liang
Abstract:
Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.
中文摘要:本文提出了MTCMB这一综合性多任务基准,通过与中医专家合作开发,系统评估大语言模型在中医知识、临床推理及安全合规方面的能力,发现现有模型虽掌握基础知识,但在实际临床应用方面存在不足。
English Summary: This paper introduces MTCMB, a comprehensive multi-task benchmark developed with TCM experts to systematically evaluate large language models' capabilities in Traditional Chinese Medicine knowledge, clinical reasoning, and safety compliance, revealing current models' strengths in basic knowledge but deficiencies in practical clinical applications.
Authors:SungHo Kim, Nayeon Kim, Taehee Jeon, SangKeun Lee
Abstract:
We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.
中文: KoGEM韩语语法评估基准测试了27个大语言模型,发现其在定义性知识任务表现出色,但在需要结合现实经验知识时存在不足,表明融入经验知识可提升语言能力。
English: KoGEM is a Korean grammar benchmark evaluating 27 LLMs, revealing their proficiency in definitional tasks but struggles with real-world knowledge integration, suggesting experiential learning could enhance their linguistic competence.
Authors:Majdi Hassan, Cristian Gabellini, Hatem Helal, Dominique Beaini, Kirill Neklyudov
Abstract:
Density Functional Theory (DFT) allows for predicting all the chemical and physical properties of molecular systems from first principles by finding an approximate solution to the many-body Schrödinger equation. However, the cost of these predictions becomes infeasible when increasing the scale of the energy evaluations, e.g., when calculating the ground-state energy for simulating molecular dynamics. Recent works have demonstrated that, for substantially large datasets of molecular conformations, Deep Learning-based models can predict the outputs of the classical DFT solvers by amortizing the corresponding optimization problems. In this paper, we propose a novel method that reduces the dependency of amortized DFT solvers on large pre-collected datasets by introducing a self-refining training strategy. Namely, we propose an efficient method that simultaneously trains a deep-learning model to predict the DFT outputs and samples molecular conformations that are used as training data for the model. We derive our method as a minimization of the variational upper bound on the KL-divergence measuring the discrepancy between the generated samples and the target Boltzmann distribution defined by the ground state energy. To demonstrate the utility of the proposed scheme, we perform an extensive empirical study comparing it with the models trained on the pre-collected datasets. Finally, we open-source our implementation of the proposed algorithm, optimized with asynchronous training and sampling stages, which enables simultaneous sampling and training. Code is available at https://github.com/majhas/self-refining-dft.
中文: 本文提出了一种自优化训练方法,通过同时训练模型和采样分子构象,降低了基于深度学习的DFT求解器对大型预收集数据集的依赖,并通过广泛实证研究验证了其有效性。
English: This paper introduces a self-refining training method that reduces the reliance of deep learning-based DFT solvers on large pre-collected datasets by simultaneously training a model and sampling molecular conformations, validated through extensive empirical comparisons.
Authors:William B. James
Abstract:
This outing is part of a larger music technology research project. The objective is to find a method for materially enhancing music using hardware and software. There is a strong likelihood that there exists a new medium for experiencing music via a wearable device that ordinary listeners prefer over the current state of the art. If such a medium is discovered, it is a step towards altruistic, prosocial reform in the music industry. A new playback system infrastructure has a chance to soothe some of the societal problems tied to the larger entertainment industry ecosystem. Iola walker is a music playback system that allows musicians to compose music that changes in accordance with the listener's gait. Artifacts are available here: https://github.com/willbjames/iolawalker
中文: 本研究旨在开发名为Iola Walker的可穿戴音乐播放系统,它能根据用户的步态调整音乐,可能提供更优的聆听体验并推动音乐行业的积极变革。
English: This research aims to develop a wearable music playback system called Iola Walker that adapts music to the listener's gait, potentially offering a preferred listening experience and fostering positive changes in the music industry.
Authors:Antonia Karamolegkou, Oliver Eberle, Phillip Rust, Carina Kauf, Anders Søgaard
Abstract:
Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: https://github.com/coastalcph/lm_ambiguity.
Chinese: 本研究通过对抗性数据集评估语言模型检测歧义的能力,发现基于模型表征训练的线性探针显著优于直接提示法,在解码歧义时准确率超过90%。
English: This study evaluates language models' ability to detect ambiguity through an adversarial dataset, finding that linear probes trained on model representations significantly outperform direct prompting, achieving over 90% accuracy in decoding ambiguity.
Authors:Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Abstract:
Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
Chinese: 本文提出了一种改进的稀疏自编码器架构,能够学习概念间的语义层次结构,从而提升大型语言模型的重构能力、可解释性及计算效率。
English: This paper introduces a modified sparse autoencoder architecture that learns a semantic hierarchy of concepts, improving reconstruction, interpretability, and computational efficiency in large language models.
Authors:Aleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin
Abstract:
Self-Supervised Learning (SSL) has demonstrated strong performance in speech processing, particularly in automatic speech recognition. In this paper, we explore an SSL pretraining framework that leverages masked language modeling with targets derived from a speech recognition model. We also present chunkwise attention with dynamic chunk size sampling during pretraining to enable both full-context and streaming fine-tuning. Our experiments examine scaling with respect to model size and the amount of data. Using our method, we train the GigaAM family of models, including a state-of-the-art model for Russian speech recognition that outperforms Whisper-large-v3 by 50%. We have released our foundation and ASR models, along with the inference code, under the MIT license as open-source resources to the research community. Available at https://github.com/salute-developers/gigaam.
中文: 本文提出了一种结合掩码语言建模和分块注意力的自监督学习框架,训练出的GigaAM模型在俄语语音识别上性能超越Whisper-large-v3达50%,并已将全部资源开源发布。
English: This paper introduces a self-supervised learning framework using masked language modeling and chunkwise attention to train the GigaAM models, achieving a 50% performance improvement over Whisper-large-v3 in Russian speech recognition and releasing all resources as open-source.
Authors:Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi
Abstract:
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
中文: 本文提出了一种双重稳健偏好优化算法,用于基于人类反馈的强化学习,该算法在偏好模型或参考策略任一正确设定时均能保持一致性,相比现有方法展现出更优越的稳健性和性能表现。
English: This paper introduces a doubly robust preference optimization algorithm for reinforcement learning from human feedback that maintains consistency when either the preference model or reference policy is correctly specified, showing superior robustness and performance compared to existing methods.
Authors:Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.
中文摘要:ZapFormat基于Earley算法提出动态剪枝策略,有效降低约束解码的计算负担,使Formatron引擎在保持高合规性的同时,将推理速度提升高达两倍,并适用于多种大语言模型架构。
English Summary: ZapFormat introduces a dynamic pruning strategy based on the Earley algorithm to reduce computational overhead in constrained decoding, enabling the Formatron engine to maintain high compliance while doubling inference speed across various LLM architectures.
Authors:Chenxiang Ma, Xinyi Chen, Kay Chen Tan, Jibin Wu
Abstract:
Spiking neural networks (SNNs) have gained significant attention for their potential to enable energy-efficient artificial intelligence. However, effective and efficient training of SNNs remains an unresolved challenge. While backpropagation through time (BPTT) achieves high accuracy, it incurs substantial memory overhead. In contrast, biologically plausible local learning methods are more memory-efficient but struggle to match the accuracy of BPTT. To bridge this gap, we propose spatio-temporal decouple learning (STDL), a novel training framework that decouples the spatial and temporal dependencies to achieve both high accuracy and training efficiency for SNNs. Specifically, to achieve spatial decoupling, STDL partitions the network into smaller subnetworks, each of which is trained independently using an auxiliary network. To address the decreased synergy among subnetworks resulting from spatial decoupling, STDL constructs each subnetwork's auxiliary network by selecting the largest subset of layers from its subsequent network layers under a memory constraint. Furthermore, STDL decouples dependencies across time steps to enable efficient online learning. Extensive evaluations on seven static and event-based vision datasets demonstrate that STDL consistently outperforms local learning methods and achieves comparable accuracy to the BPTT method with considerably reduced GPU memory cost. Notably, STDL achieves 4x reduced GPU memory than BPTT on the ImageNet dataset. Therefore, this work opens up a promising avenue for memory-efficient SNN training. Code is available at https://github.com/ChenxiangMA/STDL.
中文摘要:提出的时空解耦学习(STDL)框架通过解耦时空依赖关系,在脉冲神经网络训练中成功兼顾了高精度与低内存消耗,相比传统方法在保持精度的同时大幅降低了GPU内存需求。
English Summary: The proposed Spatio-Temporal Decouple Learning (STDL) framework effectively bridges the performance gap between memory-intensive backpropagation and less accurate local learning methods for spiking neural networks, achieving comparable accuracy with significantly reduced GPU memory usage.
Authors:Yavuz Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, Sai Praneeth Karimireddy
Abstract:
Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.
中文: 本研究系统评估了19种大语言模型不确定性估计方法在四个实际部署挑战中的表现,发现它们对阈值选择敏感、易受对抗性提示影响,同时通过集成策略展现出显著改进潜力。
English: This study systematically evaluates 19 LLM uncertainty estimation methods across four practical deployment challenges, revealing their sensitivity to threshold selection, vulnerability to adversarial prompts, and potential improvements through ensembling strategies.
Authors:Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang
Abstract:
High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.
中文: 本文提出了一种新颖的两阶段自动流程,利用专用预训练模型提取多样化多模态线索,并通过大型语言模型合成生成细致且上下文感知的音频描述,同时贡献了可扩展方法、FusionAudio数据集及增强音频模型,以提升复杂音频环境的自动化理解能力。
English: This paper introduces a novel two-stage automated pipeline that leverages specialized pretrained models to extract diverse multimodal cues and a large language model to synthesize them, generating detailed and context-aware audio captions, along with contributions including the scalable method, the FusionAudio dataset, and enhanced audio models for improved audio understanding.
Authors:Metehan Oguz, Yavuz Bakman, Duygu Nur Yaldiz
Abstract:
Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English.
中文: 本研究评估了大语言模型对“我”、“你”等指示词进行指代消解的能力,发现模型在不同指示词上表现不一,且句法线索对性能的影响存在差异。
English: This study evaluates large language models' ability to resolve coreference for indexicals like "I" and "you," revealing varied performance across different terms and the mixed impact of syntactic cues.
Authors:Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
Abstract:
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
中文摘要:zip2zip框架让大型语言模型能够在推理时动态调整词汇表,通过实时将令牌压缩为可复用的超令牌,使序列长度减少20-60%,并显著提升推理速度。
English Summary: The zip2zip framework enables large language models to dynamically adjust their token vocabulary during inference, reducing sequence length by 20-60% and significantly improving latency through real-time token compression into reusable hypertokens.
Authors:Long Qian, Eric Wang, Bernardo Subercaseaux, Marijn J. H. Heule
Abstract:
We consider the problem of finding and enumerating polyominos that can be folded into multiple non-isomorphic boxes. While several computational approaches have been proposed, including SAT, randomized algorithms, and decision diagrams, none has been able to perform at scale. We argue that existing SAT encodings are hindered by the presence of global constraints (e.g., graph connectivity or acyclicity), which are generally hard to encode effectively and hard for solvers to reason about. In this work, we propose a new SAT-based approach that replaces these global constraints with simple local constraints that have substantially better propagation properties. Our approach dramatically improves the scalability of both computing and enumerating common box unfoldings: (i) while previous approaches could only find common unfoldings of two boxes up to area 88, ours easily scales beyond 150, and (ii) while previous approaches were only able to enumerate common unfoldings up to area 30, ours scales up to 60. This allows us to rule out 46, 54, and 58 as the smallest areas allowing a common unfolding of three boxes, thereby refuting a conjecture of Xu et al. (2017).
中文: 本研究提出了一种新的基于SAT的方法,用简单的局部约束替代复杂的全局约束,显著提升了多盒多联方块展开的计算与枚举效率,从而推翻了先前关于三盒展开最小面积的猜想。
English: This study introduces a novel SAT-based method that replaces complex global constraints with simpler local ones, significantly enhancing the scalability of finding and enumerating polyomino unfoldings for multiple boxes, thereby disproving a prior conjecture about minimal areas for three-box unfoldings.
Authors:Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao He, Zheming Yang, Ming Tang, Minghui Qiu, Jinqiao Wang
Abstract:
Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at https://github.com/jefferyZhan/GThinker.
中文: GThinker是一种新型多模态推理模型,通过引入线索重思机制有效整合视觉信息,在通用场景中表现卓越,同时在数学和科学领域保持强劲性能。
English: GThinker is a novel multimodal reasoning model that introduces Cue-Rethinking to effectively integrate visual cues, achieving superior performance in general scenarios and maintaining strong results in mathematics and science.
Authors:Yueqian Guo, Tianzhao Li, Xin Lyu, Jiehaolin Chen, Zhaohan Wang, Sirui Xiao, Yurun Chen, Yezi He, Helin Li, Fan Zhang
Abstract:
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching
中文: 本文提出的TRiMM框架通过跨模态对齐、自回归建模和动作匹配技术,解决了实时合成与长文本理解的难题,在保持手势质量的同时实现了120帧/秒的实时生成性能。
English: This paper introduces TRiMM, a real-time 3D gesture generation framework that overcomes limitations in real-time synthesis and long-text comprehension through cross-modal alignment, autoregressive modeling, and motion matching, achieving 120 fps performance while maintaining gesture quality.
Authors:Amir Hossein Kargaran, Yihong Liu, François Yvon, Hinrich Schütze
Abstract:
Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.
中文摘要:本研究探索大型语言模型中多种编程语言与英语在概念空间中的关系,发现模型的概念空间更接近英语,且不同编程语言在模型各层中的特定神经元分布存在差异,揭示了编程语言内部表征的结构模式。
English Summary: This study investigates how large language models represent multiple programming languages in relation to English, finding that their concept space aligns more closely with English and that language-specific neurons are distributed differently across model layers depending on programming language characteristics.
Authors:Yudong Zhang, Ruobing Xie, Yiqing Huang, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Di Wang, Yu Wang
Abstract:
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ``fighting fire with fire'' strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code is available at https://github.com/btzyd/F3.
中文: F3框架采用“以火攻火”的创新策略,通过有意引入噪声扰动来净化大型视觉语言模型中的对抗样本,提供了一种无需训练、高效且适用于工业应用的解决方案。
English: The F3 framework introduces a novel "fighting fire with fire" strategy that uses intentional noise perturbations to purify adversarial examples in large vision-language models, offering a training-free, efficient solution suitable for industrial applications.
Authors:Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, Jihyong Oh
Abstract:
Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones while maintaining spatial and temporal coherence. VFI techniques have evolved from classical motion compensation-based approach to deep learning-based approach, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and more recently diffusion model-based approach. We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches. We systematically organize and describe VFI methodologies, detailing the core principles, design assumptions, and technical characteristics of each approach. We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges of VFI such as large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We examine applications of VFI including event-based, cartoon, medical image VFI and joint VFI with other LLV tasks. We conclude by outlining promising future research directions to support continued progress in the field. This survey aims to serve as a unified reference for both newcomers and experts seeking a deep understanding of modern VFI landscapes.
Chinese: AceVFI 是迄今为止最全面的视频帧插值综述,系统梳理了各类方法、关键挑战、数据集及应用,并展望了未来研究方向,旨在为该领域的新手和专家提供统一的参考指南。
English: AceVFI is the most comprehensive survey on Video Frame Interpolation to date, systematically organizing methodologies, challenges, datasets, and applications while outlining future research directions to serve as a unified reference for the field.
Authors:Attila Szász, Balázs Bánhelyi, Márk Jelasity
Abstract:
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.
中文: 现有的神经网络验证器尽管具备理论上的严密性,但在实际应用中无法确保安全性,因为它们容易受到针对部署环境(如浮点运算特性)设计的对抗性攻击的破坏。
English: Current neural network verifiers fail to ensure practical safety despite theoretical soundness, as demonstrated by their vulnerability to deployment-specific adversarial attacks that exploit environmental features like floating-point operations.
Authors:Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu
Abstract:
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.
中文: 本文提出SGG优化器包装器,通过动态梯度分组和簇特定缩放改进自适应学习率估计,在不同模型规模下均能提升训练稳定性、加速收敛并增强与参数高效微调技术的兼容性。
English: This paper introduces SGG, an optimizer wrapper that enhances adaptive learning rate estimation through dynamic gradient grouping and cluster-specific scaling, leading to improved training stability, faster convergence, and better compatibility with parameter-efficient fine-tuning across various model sizes.
Authors:Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, Runze Wu
Abstract:
Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance and cost. While powerful models deliver better results, they come at a high cost, whereas smaller models are more cost-effective but less capable. To address this trade-off, we propose IRT-Router, a multi-LLM routing framework that efficiently routes user queries to the most suitable LLM. Inspired by Item Response Theory (IRT), a psychological measurement methodology, IRT-Router explicitly models the relationship between LLM capabilities and user query attributes. This not only enables accurate prediction of response performance but also provides interpretable insights, such as LLM abilities and query difficulty. Additionally, we design an online query warm-up technique based on semantic similarity, further enhancing the online generalization capability of IRT-Router. Extensive experiments on 20 LLMs and 12 datasets demonstrate that IRT-Router outperforms most baseline methods in terms of effectiveness and interpretability. Its superior performance in cold-start scenarios further confirms the reliability and practicality of IRT-Router in real-world applications. Code is available at https://github.com/Mercidaiha/IRT-Router.
Chinese: IRT-Router是一种基于项目反应理论的多LLM路由框架,通过将用户查询智能分配给最合适的大语言模型,在性能与成本间实现优化,实验证明其具有卓越的有效性和可解释性。
English: IRT-Router is a novel multi-LLM routing framework that leverages Item Response Theory to optimize the balance between performance and cost by directing user queries to the most suitable large language model, demonstrating superior effectiveness and interpretability in experiments.
Authors:Phan Anh Duong, Cat Luong, Divyesh Bommana, Tianyu Jiang
Abstract:
Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman's six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones. Our dataset is publicly available at: https://github.com/menamerai/cheer-ekman.
中文摘要:研究人员开发了CHEER-Ekman数据集,基于埃克曼六种基本情绪扩展了具身情绪分类,通过优化提示指令使大型语言模型超越传统方法,并让较小模型实现了与之媲美的性能。
English Summary: Researchers developed CHEER-Ekman, an enhanced dataset for classifying embodied emotions using Ekman's six basic categories, where large language models with optimized prompts outperformed traditional methods and enabled smaller models to achieve competitive accuracy.
Authors:Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi
Abstract:
Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.
中文摘要:本研究提出图探针方法,揭示大语言模型中神经拓扑结构比神经激活更能有效预测语言生成性能,为提升模型效率与安全性提供了新途径。
English Summary: This study introduces graph probing to reveal how neural topology in large language models predicts language generation performance more effectively than neural activations, offering potential applications in enhancing model efficiency and safety.
Authors:Yu Zheng, Yuan Yuan, Yue Zhuo, Yong Li, Paolo Santi
Abstract:
Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural activations to interpretable semantics. However, the complex mechanisms that link neuron's functional co-activation with the emergent model capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity of LLM neurons and relating it to language generation performance. By probing models across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology, which persists even when retaining just 1% of neuron connections. Strikingly, probing on topology outperforms probing on activation by up to 130.4%, suggesting that neural topology contains orders of richer information of LLM performance than neural activation, which can be easily extracted with simple linear or MLP probes. To explain the dependence between neural topology and language performance, we identify default networks and hub neurons in LLMs and provide causal evidence by interventional experiments on multiple benchmarks, showing that LLMs actually exploit these topological information. Further analyses suggest that neural topology can be effectively leveraged to improve the efficiency, reliability, and safety of LLMs through proof-of-concept applications in model pruning, hallucination detection, and LLM fingerprinting. Codes and data for the graph probing toolbox are available at https://github.com/DavyMorgan/llm-graph-probing.
中文摘要:本研究提出图探针方法,揭示大语言模型中神经拓扑结构比神经激活更能有效预测语言生成性能,为提升模型效率与安全性提供了新途径。
English Summary: This study introduces graph probing to reveal how neural topology in large language models predicts language generation performance more effectively than neural activations, offering potential applications in enhancing model efficiency and safety.
Authors:Zuzheng Kuang, Haixia Bi, Chen Xu, Jian Sun
Abstract:
Recently, polarimetric synthetic aperture radar (PolSAR) image classification has been greatly promoted by deep neural networks. However,current deep learning-based PolSAR classification methods encounter difficulties due to its dependence on extensive labeled data and the computational inefficiency of architectures like Transformers. This paper presents ECP-Mamba, an efficient framework integrating multi-scale self-supervised contrastive learning with a state space model (SSM) backbone. Specifically, ECP-Mamba addresses annotation scarcity through a multi-scale predictive pretext task based on local-to-global feature correspondences, which uses a simplified self-distillation paradigm without negative sample pairs. To enhance computational efficiency,the Mamba architecture (a selective SSM) is first tailored for pixel-wise PolSAR classification task by designing a spiral scan strategy. This strategy prioritizes causally relevant features near the central pixel, leveraging the localized nature of pixel-wise classification tasks. Additionally, the lightweight Cross Mamba module is proposed to facilitates complementary multi-scale feature interaction with minimal overhead. Extensive experiments across four benchmark datasets demonstrate ECP-Mamba's effectiveness in balancing high accuracy with resource efficiency. On the Flevoland 1989 dataset, ECP-Mamba achieves state-of-the-art performance with an overall accuracy of 99.70%, average accuracy of 99.64% and Kappa coefficient of 99.62e-2. Our code will be available at https://github.com/HaixiaBi1982/ECP_Mamba.
Chinese: 本文提出ECP-Mamba框架,通过多尺度自监督对比学习和状态空间模型,有效解决了极化SAR图像分类中标注数据稀缺和计算效率低的问题,实现了最先进的分类精度。
English: This paper introduces ECP-Mamba, an efficient framework for PolSAR image classification that combines multi-scale self-supervised contrastive learning with a state space model backbone to overcome data annotation scarcity and computational inefficiency, achieving state-of-the-art accuracy.
Authors:Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro
Abstract:
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.
Chinese: AuralSAM2 提出外部连接的 AuralFuser 模块,通过特征金字塔和对比学习融合多模态特征,有效提升 SAM2 对发声物体的分割能力,在公开基准测试中显著优于现有方法。
English: AuralSAM2 introduces an external AuralFuser module that integrates multimodal features through a feature pyramid and contrastive learning to enhance SAM2's segmentation of sounding objects, achieving significant performance gains over existing methods.
Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/tmlr-group/DecoupledVP.
Chinese: 本文提出了一种解耦与重加权框架,通过分组描述并使用概率重加权矩阵整合解耦视觉提示(DVP)的输出,解决了现有视觉重编程方法的局限性,在11个数据集上提升了性能并增强了可解释性。
English: This paper introduces a decoupling-and-reweighting framework with decoupled visual prompts (DVP) that address limitations in existing visual reprogramming methods by grouping descriptions and integrating outputs via a probabilistic reweighting matrix, improving performance and interpretability across 11 datasets.
Authors:Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai
Abstract:
The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.
中文: 该研究提出了GOBench基准测试,发现当前多模态大语言模型在生成光学真实图像和理解几何光学方面存在明显不足,顶尖模型无法完美完成所有生成任务,且光学理解准确率仅为37.35%。
English: The study introduces GOBench, a benchmark revealing that current Multi-modality Large Language Models struggle with generating optically accurate images and understanding geometric optics, with top models achieving imperfect generation and only 37.35% accuracy in comprehension.
Authors:Alexander Sergeev, Valeriya Goloviznina, Mikhail Melnichenko, Evgeny Kotelnikov
Abstract:
Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various language models. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of Large Language Models to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: https://github.com/alekosus/talking-to-data-intersys2025.
中文摘要:本研究开发了一种基于大语言模型的智能助手,采用RAG技术实现数字人文档案的自然语言查询,使研究者和公众无需专业技术背景即可便捷访问。
English Summary: This study develops an LLM-based chatbot using RAG technology to enable natural language queries for digital humanities archives, enhancing accessibility for researchers and the public without technical expertise.
Authors:Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
Abstract:
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
中文摘要:与英语或多语言预训练相比,专门针对荷兰语进行预训练的Wav2Vec2自监督模型能更好地编码荷兰语音位和词汇特征,这种语言特异性优势可通过探测方法检测,并与自动语音识别性能提升相关。
English Summary: Pre-training self-supervised Wav2Vec2 models specifically on Dutch enhances the encoding of Dutch phonetic and lexical features compared to English or multilingual pre-training, with this language-specific advantage detectable through probing methods and correlating with improved automatic speech recognition performance.
Authors:Zhan Li, Mingyu Zhao, Xin Dong, Haibin Ling, Bingyao Huang
Abstract:
Projector-based adversarial attack aims to project carefully designed light patterns (i.e., adversarial projections) onto scenes to deceive deep image classifiers. It has potential applications in privacy protection and the development of more robust classifiers. However, existing approaches primarily focus on individual classifiers and fixed camera poses, often neglecting the complexities of multi-classifier systems and scenarios with varying camera poses. This limitation reduces their effectiveness when introducing new classifiers or camera poses. In this paper, we introduce Classifier-Agnostic Projector-Based Adversarial Attack (CAPAA) to address these issues. First, we develop a novel classifier-agnostic adversarial loss and optimization framework that aggregates adversarial and stealthiness loss gradients from multiple classifiers. Then, we propose an attention-based gradient weighting mechanism that concentrates perturbations on regions of high classification activation, thereby improving the robustness of adversarial projections when applied to scenes with varying camera poses. Our extensive experimental evaluations demonstrate that CAPAA achieves both a higher attack success rate and greater stealthiness compared to existing baselines. Codes are available at: https://github.com/ZhanLiQxQ/CAPAA.
中文:本文提出CAPAA,一种与分类器无关的投影式对抗攻击方法,通过新颖的优化框架和基于注意力的梯度加权机制,提升了对多分类器和不同相机位姿的鲁棒性,实现了更高的攻击成功率与隐蔽性。
English: This paper introduces CAPAA, a classifier-agnostic projector-based adversarial attack that enhances robustness across multiple classifiers and varying camera poses through a novel optimization framework and attention-based gradient weighting, achieving superior attack success and stealthiness.
Authors:Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, Youngjae Yu
Abstract:
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.
中文: MARS是一种多模态语言模型,通过结合文本与表情、肢体语言等非语言线索,并利用VENUS数据集训练,实现了更沉浸式的对话AI体验。
English: MARS is a multimodal language model that integrates nonverbal cues like facial expressions and body language with text, using the VENUS dataset to enable more immersive conversational AI experiences.
Authors:Geonu Lee, Yujeong Oh, Geonhui Jang, Soyoung Lee, Jeonghyo Song, Sungmin Cha, YoungJoon Yoo
Abstract:
In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.
中文: 本文提出了持续学习异常检测的新基准Continual-MEGA,包含多样化数据集和零样本泛化新场景,评估表明现有方法仍需改进且所提方法优于先前方案。
English: This paper introduces Continual-MEGA, a new benchmark for continual learning in anomaly detection that includes a diverse dataset and a novel scenario for zero-shot generalization, with evaluations showing existing methods need improvement and the proposed approach outperforms prior ones.
Authors:Qiao Xiao, Boqian Wu, Andrey Poddubnyy, Elena Mocanu, Phuong H. Nguyen, Mykola Pechenizkiy, Decebal Constantin Mocanu
Abstract:
Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy, leveraging aggregated updates to build robust global models. However, this training paradigm faces significant challenges due to data heterogeneity and limited local datasets, which often impede effective collaboration. In such scenarios, we identify the Layer-wise Inertia Phenomenon in FL, wherein the middle layers of global model undergo minimal updates after early communication rounds, ultimately limiting the effectiveness of global aggregation. We demonstrate the presence of this phenomenon across a wide range of federated settings, spanning diverse datasets and architectures. To address this issue, we propose LIPS (Layer-wise Inertia Phenomenon with Sparsity), a simple yet effective method that periodically introduces transient sparsity to stimulate meaningful updates and empower global aggregation. Experiments demonstrate that LIPS effectively mitigates layer-wise inertia, enhances aggregation effectiveness, and improves overall performance in various FL scenarios. This work not only deepens the understanding of layer-wise learning dynamics in FL but also paves the way for more effective collaboration strategies in resource-constrained environments. Our code is publicly available at: https://github.com/QiaoXiao7282/LIPS.
中文摘要:本研究揭示了联邦学习中的层间惯性现象,即模型中间层在早期训练后更新停滞,并提出LIPS方法通过周期性稀疏化激活层间更新,有效提升了不同联邦场景下的模型聚合性能。
English Summary: This study identifies the Layer-wise Inertia Phenomenon in federated learning where middle layers stagnate after initial training rounds, and proposes LIPS—a sparsity-based method that reactivates these layers to enhance global model performance across diverse federated scenarios.
Authors:Yongqi Li, Shen Zhou, Xiaohu Li, Xin Miao, Jintao Wen, Mayi Xu, Jianhao Chen, Birong Pan, Hankun Kang, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
Abstract:
Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals' actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code at https://github.com/NLPGM/PCogAlign.
中文摘要:视觉语言模型需要个性化对齐以适应不同的人类认知,为此构建了PCogAlignBench基准和PCogAlign框架,以实现有效的个性化辅助。
English Summary: Vision-language models need personalized alignment to match diverse human cognition, prompting the creation of PCogAlignBench and PCogAlign framework for effective individualized assistance.
Authors:Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, Gangshan Wu
Abstract:
To enhance immersive experiences, binaural audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of premium and large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, augmented by flipped-channel audio. It outperforms existing methods on both simulated and real-recorded datasets, demonstrating superior generalization and accuracy. Besides, we develop an assessment model based on Llama-3.1-8B, which evaluates the spatial semantic coherence between our generated binaural audio and text prompts through a spatial reasoning task. Results demonstrate that text prompts provide flexible and interactive control to generate binaural audio with excellent quality and semantic consistency in spatial locations. Dataset is available at \href{https://github.com/Alice01010101/TASU}
中文摘要:本文提出了一种文本引导的音频空间化(TAS)框架,通过文本提示生成具有精确空间控制的高质量双耳音频,利用新构建的数据集超越了现有方法,并在空间定位语义一致性方面表现出色。
English Summary: This paper introduces a Text-guided Audio Spatialization (TAS) framework that uses text prompts to generate high-quality binaural audio with precise spatial control, outperforming existing methods through a novel dataset and achieving superior semantic consistency in spatial positioning.
Authors:Lennart Bramlage, Cristóbal Curio
Abstract:
Uncertainty quantification is critical in safety-sensitive applications but is often omitted from off-the-shelf neural networks due to adverse effects on predictive performance. Retrofitting uncertainty estimates post-hoc typically requires access to model parameters or gradients, limiting feasibility in practice. We propose a theoretically grounded framework for post-hoc uncertainty estimation in regression tasks by fitting an auxiliary model to both original inputs and frozen model outputs. Drawing from principles of maximum likelihood estimation and sequential parameter fitting, we formalize an exact post-hoc optimization objective that recovers the canonical MLE of Gaussian parameters, without requiring sampling or approximation at inference. While prior work has used model outputs to estimate uncertainty, we explicitly characterize the conditions under which this is valid and demonstrate the extent to which structured outputs can support quasi-epistemic inference. We find that using diverse auxiliary data, such as augmented subsets of the original training data, significantly enhances OOD detection and metric performance. Our hypothesis that frozen model outputs contain generalizable latent information about model error and predictive uncertainty is tested and confirmed. Finally, we ensure that our method maintains proper estimation of input-dependent uncertainty without relying exclusively on base model forecasts. These findings are demonstrated in toy problems and adapted to both UCI and depth regression benchmarks. Code: https://github.com/biggzlar/IO-CUE.
Chinese: 本文提出了一种基于理论的后验不确定性估计框架,通过将辅助模型拟合到原始输入和冻结模型输出中,有效提升了分布外检测性能,并在不依赖基础模型参数的情况下保持了不确定性量化的准确性。
English: This paper introduces a theoretically grounded framework for post-hoc uncertainty estimation in regression by fitting an auxiliary model to inputs and frozen model outputs, which enhances out-of-distribution detection and maintains accurate uncertainty quantification without relying on base model parameters.
Authors:Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Dongseop Kim, Sung Ju Hwang
Abstract:
Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by transferring the knowledge from teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL operates in data-scarce scenarios where task-specific teacher models are often unavailable. In this paper, we first introduce ActiveKD, a framework that integrates AL with KD by leveraging the zero- and few-shot capabilities of large vision-language models (VLMs). A key aspect of ActiveKD is the structured prediction bias of VLMs-i.e., their predictions form clusters in the probability space. We regard this structure as an inductive bias of the teacher model, capturing generalizable output patterns beneficial to student learning. To exploit this bias, we propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space. PCoreSet strategically selects probabilistically diverse unlabeled samples, facilitating more efficient transfer of teacher knowledge under limited annotation budgets. Extensive evaluations on 11 datasets show that ActiveKD consistently improves performance across selection methods (e.g., +29.07% on ImageNet, averaged over methods). Under ActiveKD, PCoreSet ranks first in 64/73 settings (approximately 87.7%) across 5 student and 3 teacher networks, always achieving the best performance except for first 2 AL rounds. Our code is available at https://github.com/erjui/PCoreSet.
中文:ActiveKD框架将主动学习与知识蒸馏相结合,利用视觉语言模型的结构化预测偏差,在有限标注预算下实现高效知识迁移,在多个数据集上取得显著性能提升。
English: ActiveKD integrates active learning with knowledge distillation by leveraging the structured prediction bias of vision-language models to efficiently transfer knowledge under limited annotation budgets, achieving significant performance improvements across multiple datasets.
Authors:Jinfeng Zhou, Yuxuan Chen, Yihan Shi, Xuanming Zhang, Leqi Lei, Yi Feng, Zexuan Xiong, Miao Yan, Xunzhi Wang, Yaru Cao, Jianing Yin, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, Minlie Huang
Abstract:
LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
中文摘要:大语言模型展现出有前景的社会智能,但在目标达成和人际能力方面仍落后于人类,这通过整合结果导向与过程导向评估的SocialEval基准测试得以验证。
English Summary: LLMs demonstrate promising social intelligence but still lag behind humans in both goal achievement and interpersonal abilities, as evaluated by the SocialEval benchmark which integrates outcome- and process-oriented assessments.
Authors:Keyuan Cheng, Xudong Shen, Yihao Yang, Tengyue Wang, Yang Cao, Muhammad Asif Ali, Hanbin Wang, Lijie Hu, Di Wang
Abstract:
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
中文: 该摘要介绍了CODEMENV这一评估大语言模型在代码迁移任务中表现的新基准,结果显示平均通过率为26.50%,并揭示了模型对新版本函数的熟练度以及偶尔出现的逻辑不一致问题。
English: This abstract introduces CODEMENV, a new benchmark for evaluating large language models' performance in code migration tasks, revealing an average pass rate of 26.50% and highlighting both their proficiency with newer function versions and occasional logical inconsistencies.
Authors:Sa Zhu, Huashan Chen, Wanqian Zhang, Jinchao Zhang, Zexian Yang, Xiaoshuai Hao, Bo Li
Abstract:
Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal-length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive-Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context-Aware Event Refinement (CAER) module to refine the event representation conditioned the text's cross-attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text-video alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two PRVR benchmarks. Code is available at https://github.com/Sasa77777779/UEM.git.
中文摘要:本研究提出的非均匀事件建模(UEM)框架通过渐进式视频分割明确事件边界,并利用上下文感知优化实现精准的文本-视频对齐,在部分相关视频检索任务中取得了最优性能。
English Summary: The proposed Uneven Event Modeling (UEM) framework addresses limitations in partially relevant video retrieval by introducing progressive video segmentation for clear event boundaries and context-aware refinement for precise text-video alignment, achieving state-of-the-art results.
Authors:Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Abstract:
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.
中文: 本文提出了MultiFakeVerse这一大规模人物中心深度伪造数据集,通过视觉语言模型生成具有语义意义的篡改内容,现有检测方法和人类观察者均难以识别这些篡改。
English: This paper introduces MultiFakeVerse, a large-scale person-centric deepfake dataset created using vision-language models to generate semantically meaningful manipulations that current detection methods and humans find challenging to identify.
Authors:Nidhi Kowtal, Raviraj Joshi
Abstract:
Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP
中文摘要:本研究推出L3Cube-MahaEmotions马拉地语情感数据集,通过合成标注与人工验证相结合的方式,证明通用大语言模型(如GPT-4)在低资源情感识别任务中优于微调的BERT模型。
English Summary: This study introduces L3Cube-MahaEmotions, a Marathi emotion dataset combining synthetic LLM annotations with human-verified labels, demonstrating that general-purpose LLMs like GPT-4 outperform fine-tuned BERT models in low-resource emotion recognition tasks.
Authors:Nidhi Kowtal, Raviraj Joshi
Abstract:
Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP
中文摘要:本研究推出L3Cube-MahaEmotions马拉地语情感数据集,通过合成标注与人工验证相结合的方式,证明通用大语言模型(如GPT-4)在低资源情感识别任务中优于微调的BERT模型。
English Summary: This study introduces L3Cube-MahaEmotions, a Marathi emotion dataset combining synthetic LLM annotations with human-verified labels, demonstrating that general-purpose LLMs like GPT-4 outperform fine-tuned BERT models in low-resource emotion recognition tasks.
Authors:Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, Tailin Wu
Abstract:
Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirical and theoretical evidence showing that generative models struggle with significant spectral bias and common-mode noise when generating high-fidelity turbulent flows. Here we propose FourierFlow, a novel generative turbulence modeling framework that enhances the frequency-aware learning by both implicitly and explicitly mitigating spectral bias and common-mode noise. FourierFlow comprises three key innovations. Firstly, we adopt a dual-branch backbone architecture, consisting of a salient flow attention branch with local-global awareness to focus on sensitive turbulence areas. Secondly, we introduce a frequency-guided Fourier mixing branch, which is integrated via an adaptive fusion strategy to explicitly mitigate spectral bias in the generative model. Thirdly, we leverage the high-frequency modeling capabilities of the masked auto-encoder pre-training and implicitly align the features of the generative model toward high-frequency components. We validate the effectiveness of FourierFlow on three canonical turbulent flow scenarios, demonstrating superior performance compared to state-of-the-art methods. Furthermore, we show that our model exhibits strong generalization capabilities in challenging settings such as out-of-distribution domains, long-term temporal extrapolation, and robustness to noisy inputs. The code can be found at https://github.com/AI4Science-WestlakeU/FourierFlow.
中文: FourierFlow是一种创新的生成式湍流建模框架,通过双分支架构和频率感知学习解决频谱偏差和共模噪声问题,在湍流场景中展现出卓越性能和强大泛化能力。
English: FourierFlow is a novel generative turbulence modeling framework that addresses spectral bias and common-mode noise through dual-branch architecture and frequency-aware learning, demonstrating superior performance and strong generalization in turbulent flow scenarios.
Authors:Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna
Abstract:
This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating proposed RFA-derived rhythm spectrograms with vision transformer (ViT) for acoustic representations along with BERT-based linguistic embeddings. We compare these with existing features. Notably, our handcrafted features outperform eGeMAPs with a relative improvement of $14.2\%$ in classification accuracy and comparable performance in the regression task. The fusion approach also shows improvement, with RFA spectrograms surpassing Mel spectrograms in classification by around a relative improvement of $13.1\%$ and a comparable regression score with the baselines.
中文: 本研究引入节奏共振峰分析(RFA)生成节奏谱图用于痴呆症分类与回归任务,结果表明手工特征及与视觉变换器的融合方法均优于现有技术,准确率显著提升。
English: This study introduces Rhythm Formant Analysis (RFA) to create rhythm spectrograms for dementia classification and regression, demonstrating that both handcrafted features and a fusion approach with vision transformers outperform existing methods with significant accuracy improvements.
Authors:Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, Hao Chen
Abstract:
The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and provide technological guidance. Meanwhile, as an invaluable knowledge source, the potential of medical textbooks for benchmark development remains underexploited. Here, we present MedBookVQA, a systematic and comprehensive multimodal benchmark derived from open-access medical textbooks. To curate this benchmark, we propose a standardized pipeline for automated extraction of medical figures while contextually aligning them with corresponding medical narratives. Based on this curated data, we generate 5,000 clinically relevant questions spanning modality recognition, disease classification, anatomical identification, symptom diagnosis, and surgical procedures. A multi-tier annotation system categorizes queries through hierarchical taxonomies encompassing medical imaging modalities (42 categories), body anatomies (125 structures), and clinical specialties (31 departments), enabling nuanced analysis across medical subdomains. We evaluate a wide array of MLLMs, including proprietary, open-sourced, medical, and reasoning models, revealing significant performance disparities across task types and model categories. Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools. MedBookVQA establishes textbook-derived benchmarking as a critical paradigm for advancing clinical AI, exposing limitations in GMAI systems while providing anatomically structured performance metrics across specialties.
中文: MedBookVQA通过从医学教科书提取多模态数据构建了系统性评估基准,揭示了通用医疗AI在临床任务中的显著能力差距,确立了基于教科书的多模态基准对推进临床AI发展的关键价值。
English: MedBookVQA introduces a comprehensive multimodal benchmark derived from medical textbooks to evaluate general medical AI systems, revealing significant performance gaps across clinical tasks while establishing textbook-based evaluation as essential for advancing clinical AI.
Authors:Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu
Abstract:
Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO
中文摘要:本文提出协同偏好优化(SynPO)新方法,通过构建高质量偏好对和优化训练过程,在细粒度视频描述任务中有效提升性能并实现20%的训练效率提升。
English Summary: This paper introduces Synergistic Preference Optimization (SynPO), a novel method that enhances fine-grained video captioning by improving training efficiency and preventing optimization issues found in direct preference optimization, achieving a 20% efficiency gain.
Authors:Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang
Abstract:
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
中文: 本文提出了COMPKE新基准,通过复杂的现实问题评估大语言模型的知识编辑能力,发现不同模型和方法的效果存在显著差异。
English: This paper introduces COMPKE, a new benchmark designed to evaluate knowledge editing in large language models through complex, real-life questions, revealing significant performance variations across different models and methods.
Authors:Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin
Abstract:
Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization
Chinese Summary: 本研究探讨了大语言模型中“真实性方向”的一致性和泛化能力,发现其因模型而异,并能有效应用于多种任务以提升输出真实性,从而增强用户信任。
English Summary: This study investigates the consistency and generalizability of "truth directions" in large language models, finding they vary across models and can be effectively applied to enhance truthfulness in various tasks, thereby improving user trust.
Authors:Jiatong Li, Libo Zhu, Haotong Qin, Jingkai Wang, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Abstract:
Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4$\sim$6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at https://github.com/jiatongli2024/QuantFace.
中文摘要:QuantFace提出了一种针对一步扩散人脸修复模型的低比特量化方法,通过旋转缩放通道平衡和QD-LoRA等技术在降低计算量的同时保持性能,使其更适合在智能手机等设备上部署。
English Summary: QuantFace introduces a low-bit quantization method for one-step diffusion face restoration models, employing techniques like rotation-scaling channel balancing and QD-LoRA to optimize performance while reducing computations for deployment on devices like smartphones.
Authors:Xiang Zhang, Run He, Jiao Chen, Di Fang, Ming Li, Ziqian Zeng, Cen Chen, Huiping Zhuang
Abstract:
Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available at https://github.com/scut-zx/L3A.
中文:提出的标签增强解析适应(L3A)框架通过生成伪标签弥补历史标签缺失,并采用加权解析分类器缓解类别不平衡,在无需存储历史样本的情况下,于多标签类增量学习任务中实现了优于现有方法的性能。
English: The proposed Label-Augmented Analytic Adaptation (L3A) framework addresses multi-label class-incremental learning challenges by generating pseudo-labels to compensate for missing historical labels and employing a weighted analytic classifier to mitigate class imbalance, demonstrating superior performance on benchmark datasets without storing past exemplars.
Authors:Jingyi Xi, Chenghao Mo, Benjamin Karsin, Artem Chirkin, Mingqin Li, Minjia Zhang
Abstract:
Vector search and database systems have become a keystone component in many AI applications. While many prior research has investigated how to accelerate the performance of generic vector search, emerging AI applications require running more sophisticated vector queries efficiently, such as vector search with attribute filters. Unfortunately, recent filtered-ANNS solutions are primarily designed for CPUs, with few exploration and limited performance of filtered-ANNS that take advantage of the massive parallelism offered by GPUs. In this paper, we present VecFlow, a novel high-performance vector filtered search system that achieves unprecedented high throughput and recall while obtaining low latency for filtered-ANNS on GPUs. We propose a novel label-centric indexing and search algorithm that significantly improves the selectivity of ANNS with filters. In addition to algorithmic level optimization, we provide architectural-aware optimization for VecFlow's functional modules, effectively supporting both small batch and large batch queries, and single-label and multi-label query processing. Experimental results on NVIDIA A100 GPU over several public available datasets validate that VecFlow achieves 5 million QPS for recall 90%, outperforming state-of-the-art CPU-based solutions such as Filtered-DiskANN by up to 135 times. Alternatively, VecFlow can easily extend its support to high recall 99% regime, whereas strong GPU-based baselines plateau at around 80% recall. The source code is available at https://github.com/Supercomputing-System-AI-Lab/VecFlow.
Chinese: VecFlow是一种基于GPU的高性能向量搜索系统,在带过滤条件的近似最近邻搜索中实现了前所未有的吞吐量和召回率,性能比最先进的CPU解决方案高出135倍。
English: VecFlow is a high-performance GPU-based vector search system that achieves unprecedented throughput and recall for filtered approximate nearest neighbor search, outperforming state-of-the-art CPU solutions by up to 135 times.
Authors:Xuejiao Ma, Haibo Zhao, Zinuo Guo, Yijie Guo, Guanhong Liu, Bo Jiang
Abstract:
Drama-in-education is an interdisciplinary instructional approach that integrates subjects such as language, history, and psychology. Its core component is playwriting. Based on need-finding interviews of 13 teachers, we found that current general-purpose AI tools cannot effectively assist teachers and students during playwriting. Therefore, we propose CO-OPERA - a collaborative playwriting tool integrating generative artificial intelligence capabilities. In CO-OPERA, users can both expand their thinking through discussions with a tutor and converge their thinking by operating agents to generate script elements. Additionally, the system allows for iterative modifications and regenerations based on user requirements. A system usability test conducted with middle school students shows that our CO-OPERA helps users focus on whole logical narrative development during playwriting. Our playwriting examples and raw data for qualitative and quantitative analysis are available at https://github.com/daisyinb612/CO-OPERA.
中文摘要:CO-OPERA是一款集成生成式人工智能的协作剧本创作工具,通过导师对话和角色操作帮助用户发散与收敛创作思维,可用性测试表明该系统能有效支持用户在创作过程中聚焦整体叙事逻辑。
English Summary: CO-OPERA is an AI-powered collaborative playwriting tool designed to help users expand and refine their creative thinking through tutor discussions and script element generation, with usability tests showing it effectively supports logical narrative development.
Authors:Rong Wu, Pinlong Cai, Jianbiao Mei, Licheng Wen, Tao Hu, Xuemeng Yang, Daocheng Fu, Botian Shi
Abstract:
Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at https://github.com/Edaizi/KG-TRACES.
Chinese: 提出的KG-TRACES框架通过监督推理路径和过程来增强大语言模型的复杂推理能力,在知识图谱可用与不可用场景下均实现了更优的性能与可解释性。
English: The proposed KG-TRACES framework enhances LLMs' complex reasoning by supervising reasoning paths and processes, achieving superior performance and explainability in both KG-available and KG-unavailable scenarios.
Authors:Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, Jimmy Huang
Abstract:
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: https://github.com/tahmedge/llm_judge_biomedical_re.
中文摘要:本研究探讨使用大语言模型作为生物医学关系抽取的评估者,发现其低准确率源于输出格式不一致,并提出结构化格式和领域自适应方法,将性能平均提升约15%。
English Summary: This study explores using LLMs as judges for biomedical relation extraction, finding their low accuracy stems from inconsistent output formats and proposing structured formatting and domain adaptation to boost performance by about 15%.
Authors:Milad Khanchi, Maria Amer, Charalambos Poullis
Abstract:
Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at https://github.com/Milad-Khanchi/DepthMOT
Chinese Summary: 本文提出一种深度感知多目标跟踪框架,通过将估计深度作为独立特征并结合分层对齐评分来改进目标关联,无需训练即实现最先进的性能。
English Summary: This paper introduces a depth-aware multiple object tracking framework that enhances object association by incorporating estimated depth as an independent feature and a hierarchical alignment score, achieving state-of-the-art performance without training.
Authors:Boheng Sheng, Jiacheng Yao, Meicong Zhang, Guoxiu He
Abstract:
Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: https://github.com/ECNU-Text-Computing/DCS
中文摘要:该方法通过语义相似度动态分割长文本为变长片段,并利用问题感知分类器筛选关键内容,显著提升大语言模型在长文本问答任务中的表现。
English Summary: The proposed method dynamically segments long texts into variable-length chunks based on semantic similarity and uses a question-aware classifier to select key segments, significantly improving LLMs' performance on long-context question-answering tasks.
Authors:Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Abstract:
Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.
中文: 近期研究提出低秩信息稀疏微调方法LIFT,通过低秩近似识别并仅更新前5%的主权重,在推理任务中表现优于全参数微调,同时保持高内存效率并显著保留源领域知识。
English: Recent research introduces Low-rank Informed Sparse Fine-Tuning (LIFT), a method that updates only the top 5% of principal weights identified through low-rank approximation, achieving superior reasoning performance and memory efficiency compared to full fine-tuning while better preserving source-domain knowledge.
Authors:Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan
Abstract:
Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, E(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an E(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder's latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.
中文: 本研究提出的MolFLAE方法通过变分自编码器构建等变潜在空间,实现无需重新训练的零样本分子编辑与性质优化,在药物优化任务中展现出实际应用价值。
English: This study introduces MolFLAE, a zero-shot 3D molecule manipulation method that uses a variational autoencoder to create an equivariant latent space, enabling structure editing and property optimization without retraining, as demonstrated in drug optimization tasks.
Authors:Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed
Abstract:
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
Chinese: DefenderBench 是一个评估大语言模型在网络安全任务中表现的开源工具包,其中 Claude-3.7-sonnet 在基准测试中以 81.65 分获得最高分。
English: DefenderBench is an open-source toolkit for evaluating LLM agents in cybersecurity tasks, with Claude-3.7-sonnet achieving the highest score of 81.65 in benchmark tests.
Authors:Tianze Yang, Tyson Jordan, Ninghao Liu, Jin Sun
Abstract:
We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: https://github.com/YangTianze009/COinCO.
中文摘要:COinCO数据集通过系统替换图像中的对象创建了97,722张包含上下文一致与不一致场景的图像,为推进计算机视觉中的上下文感知理解和图像取证研究提供了重要基础。
English Summary: The COinCO dataset introduces 97,722 images with systematically replaced objects to study contextual coherence, enabling advancements in context-aware computer vision tasks and image forensics.
Authors:Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang
Abstract:
Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.
中文: QoQ-Med-7B/32B 是首个开放通用的临床基础模型,能够联合推理医学图像、时间序列信号和文本报告,并通过创新的领域感知相对策略优化(DRPO)显著提升诊断性能,有效应对临床数据分布不均的问题。
English: QoQ-Med-7B/32B is the first open generalist clinical foundation model that integrates reasoning across medical images, time-series signals, and text reports, utilizing a novel Domain-aware Relative Policy Optimization (DRPO) to enhance diagnostic performance and address data imbalance across clinical specialties.
Authors:Valter Hudovernik, Minkai Xu, Juntong Shi, Lovro Å ubelj, Stefano Ermon, Erik Å trumbelj, Jure Leskovec
Abstract:
Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at https://github.com/ValterH/RelDiff.
中文摘要:RelDiff是一种新颖的扩散模型,通过显式建模外键图结构来合成完整的关系数据库,结合属性合成与图生成技术,在生成真实合成数据方面优于现有方法。
English Summary: RelDiff is a novel diffusion model that synthesizes complete relational databases by explicitly modeling foreign key graph structures, combining attribute synthesis with graph generation to outperform existing methods in producing realistic synthetic data.
Authors:Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Abstract:
Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.
Chinese: 本文提出CORTEX框架,通过样本级和码本级的解释方法识别概念相关的令牌组合来解读矢量量化生成模型,有效提升模型透明度并支持图像编辑和特征检测等应用。
English: This paper introduces CORTEX, a novel framework that interprets Vector-Quantized Generative Models by identifying concept-specific token combinations through sample-level and codebook-level methods, enhancing model transparency and enabling applications like image editing and feature detection.
Authors:Yunguan Fu, Wenjia Bai, Weixi Yi, Charlotte Manisty, Anish N Bhuva, Thomas A Treibel, James C Moon, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu
Abstract:
Here we present a versatile foundation model that can perform a range of clinically-relevant image analysis tasks, including segmentation, landmark localisation, diagnosis, and prognostication. A multi-view convolution-transformer masked autoencoder, named as CineMA, was trained on 15 million cine images from 74,916 subjects. The model was validated on multiple image analysis tasks and compared to existing models on >4,500 images from eight independent datasets with diverse population characteristics, representing the largest benchmark study for cine CMR so far. CineMA consistently outperformed conventional convolutional neural networks (CNNs) in delineating ventricular boundaries and estimating ejection fraction, a key measure of cardiac function. The improved performance was preserved, even when the model only used half of fine-tuning data. CineMA also surpassed CNNs in disease detection and matched their performance in long-axis function measurement. Interestingly, we found that CineMA can also detect cardiac changes in systemic diseases, such as diabetes, hypertension and cancer, and can also predict mortality. Finally, we assessed model fairness and demonstrated consistent model performance across demographic subgroups. These findings highlight CineMA's accuracy, learning efficiency, adaptability, and fairness, underscoring its potential as a foundation model for automated cardiac image analysis to support clinical workflow and cardiovascular research. All training and inference code and models are made publicly available at https://github.com/mathpluscode/CineMA.
中文: CineMA作为一种基于1500万电影图像训练的多视角卷积-变换器掩码自编码器,在心脏图像分割、疾病检测和死亡率预测等任务中持续优于传统卷积神经网络,同时展现出卓越的学习效率和在人口统计学群体间的公平性。
English: CineMA, a multi-view convolution-transformer masked autoencoder trained on 15 million cine images, consistently outperforms conventional CNNs in cardiac image analysis tasks including segmentation, disease detection, and mortality prediction while demonstrating high learning efficiency and fairness across demographic groups.
Authors:Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla
Abstract:
As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: https://github.com/criticalml-uw/SafeTuneBed
Chinese: 针对大语言模型参数高效微调与安全防御方法评估标准不统一的问题,SafeTuneBed工具包通过整合多样化数据集、集成先进防御机制、提供标准化评估指标,建立了首个专注于安全微调的可复现基准框架,有力推动该领域研究的规范化和可比性。
English: The proliferation of diverse evaluation methods for parameter-efficient fine-tuning and safety defenses in large language models has created challenges in fair comparison, prompting the introduction of SafeTuneBed—a unified benchmark and toolkit that standardizes datasets, defenses, and metrics to enable rigorous and reproducible safety research.
Authors:Yu Huang, Junhao Chen, Shuliang Liu, Hanqian Li, Qi Zheng, Yi R. Fung, Xuming Hu
Abstract:
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios. Our code is available at \href{https://github.com/hardenyu21/Video-Signature}{here}
Chinese: VIDSIG提出了一种针对潜在视频扩散模型的生成内水印方法,在视频生成过程中自适应地嵌入水印,在提取精度、视觉质量和效率方面表现优异,同时具备强大的抗篡改鲁棒性。
English: VIDSIG introduces an in-generation watermarking method for latent video diffusion models that integrates watermarks adaptively during video creation, achieving superior performance in extraction accuracy, visual quality, and efficiency while maintaining robustness against tampering.
Authors:Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Abstract:
Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation. Code at https://github.com/cosbidev/Text2CT.
中文摘要:本研究提出了一种结合三维对比视觉语言预训练与潜在扩散模型的新型文本到CT生成方法,在根据文本描述合成具有临床意义的CT体积方面展现出卓越性能。
English Summary: This study introduces a novel text-to-CT generation method combining 3D contrastive vision-language pretraining with latent diffusion models, demonstrating superior performance in synthesizing clinically relevant CT volumes from text descriptions.
Authors:Ruiming Min, Minghao Liu
Abstract:
With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at https://github.com/mliuby/COMP4211-Project.
中文: 机器学习的卷积神经网络和CycleGAN等模型为生成合成医学图像提供了可行方案,既解决了医学教育资源匮乏的问题,又有效保护了患者隐私。
English: Machine learning models like CNNs and CycleGAN offer a promising solution to generate synthetic medical images, addressing the shortage of educational resources while preserving patient privacy in modern medical education.
Authors:Yufa Zhou, Shaobo Wang, Xingyu Dong, Xiangqi Jin, Yifang Chen, Yue Min, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
Abstract:
Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $\textit{generalize}$ to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce $\textbf{Recon}$ ($\textbf{R}$easoning like an $\textbf{ECON}$omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .
中文: 本文介绍了Recon,一个通过监督微调和可验证奖励强化学习在经济学推理问题上进行后训练的70亿参数大语言模型,在多智能体场景中展现出增强的结构化推理和经济理性能力。
English: This paper introduces Recon, a 7B-parameter LLM post-trained using SFT and RLVR techniques on economic reasoning problems, demonstrating improved structured reasoning and economic rationality in multi-agent scenarios.
Authors:Yule Zhu, Ping Liu, Zhedong Zheng, Wei Liu
Abstract:
Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: https://github.com/Zeus1037/SEED.
中文: 扩散模型实现了精确的面部编辑,但追踪连续编辑仍具挑战,为此开发了SEED数据集和FAITH模型,以推动该领域研究。
English: Diffusion models enable precise facial editing, but tracking sequential edits remains challenging, leading to the creation of the SEED dataset and FAITH model to advance research in this area.
Authors:Ming Wang, Peidong Wang, Lin Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang
Abstract:
Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers' mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator's configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.
中文:研究人员开发了AnnaAgent,这是一种具备动态情绪控制和多会话记忆功能的先进对话代理,通过整合真实心理咨询数据克服了模拟心理健康求助者的局限性,并在评估中展现出更优越的性能。
English: Researchers have developed AnnaAgent, an advanced conversational agent with dynamic emotional control and multi-session memory, to overcome limitations in simulating realistic mental health seekers by incorporating real counseling data and achieving superior performance in evaluations.
Authors:Hyangsuk Min, Yuho Lee, Minjeong Ban, Jiaqi Deng, Nicole Hee-Yeon Kim, Taewon Yun, Hang Su, Jason Cai, Hwanjun Song
Abstract:
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL-Lab/MSumBench.
中文:该摘要介绍了MSumBench,一个针对英文和中文文本摘要的多维、多领域评估基准,它通过引入专业评估标准和多智能体辩论系统解决了领域特定标准缺失和人工标注的难题,揭示了不同模型的表现模式以及大型语言模型作为评估者时的系统性偏见。
English: This abstract introduces MSumBench, a multi-dimensional and multi-domain evaluation benchmark for text summarization in English and Chinese, which addresses gaps in domain-specific criteria and human annotation challenges by incorporating specialized assessments and a multi-agent debate system, revealing distinct model performance patterns and biases in large language models as evaluators.
Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu
Abstract:
Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning path.In this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: https://github.com/bebr2/DecKER .
中文: 知识编辑中的上下文编辑方法存在推理与知识纠缠的问题,DecKER通过解耦推理路径并采用混合验证,有效缓解知识冲突,在多跳任务中显著提升了推理一致性和准确性。
English: Knowledge editing in LLMs, particularly in-context editing (ICE), faces challenges from entangled reasoning and knowledge, which DecKER addresses by decoupling reasoning paths and using hybrid validation to enhance consistency and accuracy in multi-hop tasks.
Authors:Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang
Abstract:
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.
中文: 研究发现,压缩型视觉语言投影器会给大型视觉语言模型带来严重安全漏洞,而非压缩型投影器具备稳健安全性,为选择安全的投影器提供了关键指导。
English: The study finds that compressed visual language projectors (VLPs) introduce significant security vulnerabilities in large visual language models (LVLMs), while uncompressed projectors maintain robust security, offering crucial guidance for selecting secure VLPs.
Authors:Tianhui Liu, Jie Feng, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Yong Li
Abstract:
Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce $\textbf{CityLens}$, a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.
中文摘要:CityLens是一个通过卫星和街景图像评估大语言视觉模型预测城市社会经济指标能力的综合基准,既揭示了现有模型的局限性,也为未来发展提供了统一框架。
English Summary: CityLens is a comprehensive benchmark that evaluates large language-vision models' ability to predict urban socioeconomic indicators from visual data, revealing their current limitations while providing a framework for future improvements.
Authors:Runtao Ren, Jian Ma, Jianxi Luo
Abstract:
Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent Q&A dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at https://github.com/renruntao/patent_rag.
Chinese: MQG-RFM框架通过利用大语言模型模拟多样化用户查询并微调检索模型,显著提升了知识产权领域检索增强生成的准确性和生成质量,无需复杂架构改动即可实现高效部署。
English: The MQG-RFM framework enhances retrieval-augmented generation in intellectual property by using large language models to simulate diverse user queries and fine-tune retrieval models, achieving significant improvements in accuracy and generation quality without complex architectural changes.
Authors:Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, Jun Zhang
Abstract:
The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled up discriminator models, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX. The source code will be avaliable at https://github.com/XingtongGe/SenseFlow.
中文: 本文提出SenseFlow,通过引入隐式分布对齐(IDA)和段内引导(ISG)来改进分布匹配蒸馏(DMD),解决了在SD 3.5和FLUX等大规模文本到图像模型中的收敛难题,实现了卓越的蒸馏性能。
English: The paper introduces SenseFlow, which enhances Distribution Matching Distillation (DMD) with implicit distribution alignment (IDA) and intra-segment guidance (ISG) to overcome convergence issues in large-scale text-to-image models like SD 3.5 and FLUX, achieving superior performance.
Authors:Yuxi Sun, Aoqi Zuo, Wei Gao, Jing Ma
Abstract:
Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to \textit{abstain} when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce \textit{CausalAbstain}, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that \textit{CausalAbstain} effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (\textsc{Casual-native}) and multilingual (\textsc{Causal-multi}) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks. Our code and data are open-sourced at https://github.com/peachch/CausalAbstain.
Chinese: 为解决多语言大语言模型中的知识差异导致的幻觉问题,CausalAbstain方法通过因果分析筛选有效反馈以优化弃权决策,在单语和多语言场景下均展现出优于基准方法的性能表现。
English: To mitigate hallucinations caused by knowledge gaps in multilingual Large Language Models, the proposed CausalAbstain method employs causal analysis to select useful feedback for improving abstention decisions, demonstrating superior performance in both native and multilingual settings.
Authors:Zherui Li, Yan Mi, Zhenhong Zhou, Houcheng Jiang, Guibin Zhang, Kun Wang, Junfeng Fang
Abstract:
Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: https://github.com/zhrli324/ARGUS.
中文:基于大语言模型的多智能体系统易受虚假信息攻击,而提出的ARGUS防御框架能显著降低28.17%的毒性并提升10.33%的任务成功率。
English: Large Language Model-based Multi-Agent Systems are vulnerable to misinformation, but the proposed ARGUS defense framework effectively reduces toxicity by 28.17% and improves task success rates by 10.33%.
Authors:Dohyun Lee, Seungil Chad Lee, Chanwoo Yang, Yujin Baek, Jaegul Choo
Abstract:
Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at https://github.com/aiclaudev/DAT.
大语言模型在机器翻译中通过上下文学习表现出色,但其依赖人工标注示例对的特性限制了在低资源语言中的应用,因此本研究提出DAT方法,无需外部资源即可生成相关且多样的示例对,从而提升翻译质量。
Large language models excel in machine translation through in-context learning, but their reliance on human-annotated example pairs limits their use for low-resource languages, prompting this study to propose DAT, a method that generates relevant and diverse examples without external resources to enhance translation quality.
Authors:Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo
Abstract:
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
中文摘要:本研究发布个性化视觉说服(PVP)数据集,填补图像说服力与评估者个人信息关联数据的空白,实验证明结合心理特征能有效提升人工智能生成说服性图像的效果。
English Summary: The study introduces the Personalized Visual Persuasion (PVP) dataset to address the lack of comprehensive data linking image persuasiveness with evaluators' personal information, demonstrating that incorporating psychological traits improves AI-generated persuasive images.
Authors:Leila Mahmoodi, Peyman Moghadam, Munawar Hayat, Christian Simon, Mehrtash Harandi
Abstract:
We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge. FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates. Theoretically, we analyze how the FL mechanism enhances the stability-plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.
中文摘要:Flashback Learning(FL)是一种新颖的持续学习方法,通过双向正则化有效平衡模型的稳定性与可塑性,在多个基准测试中显著提升准确率并优化了稳定性与可塑性的平衡。
English Summary: Flashback Learning (FL) is a novel continual learning method that balances stability and plasticity through bidirectional regularization, achieving significant accuracy improvements and better stability-plasticity balance across various benchmarks.
Authors:Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv
Abstract:
The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic temporal operations (e.g., one-dimensional convolution), which are unable to effectively extract target speaker information. To address these issues, this paper proposes a multi-scale and multi-modal alignment network (M3ANet) for brain-assisted TSE. Specifically, to eliminate the temporal inconsistency between EEG and speech modalities, the modal alignment module that uses a contrastive learning strategy is applied to align the temporal features of both modalities. Additionally, to fully extract speech information, multi-scale convolutions with GroupMamba modules are used as the speech encoder, which scans speech features at each scale from different directions, enabling the model to capture deep sequence information. Experimental results on three publicly available datasets show that the proposed model outperforms current state-of-the-art methods across various evaluation metrics, highlighting the effectiveness of our proposed method. The source code is available at: https://github.com/fchest/M3ANet.
中文: 本文提出M3ANet多尺度多模态对齐网络,通过对比学习解决脑电与语音的时序失配问题,并采用集成GroupMamba模块的多尺度卷积充分提取语音特征,在脑辅助目标说话人提取任务中实现了最优性能。
English: This paper introduces M3ANet, a novel multi-scale and multi-modal alignment network that addresses temporal misalignment between EEG and speech through contrastive learning and enhances speech feature extraction using multi-scale convolutions with GroupMamba, achieving state-of-the-art performance in brain-assisted target speaker extraction.
Authors:Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu
Abstract:
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
中文:当前音频伪造检测器在受控环境中准确率接近完美,但在跨领域场景下表现不佳,凸显了开发能跨语言、说话人和生成方法泛化的鲁棒模型的必要性。
English: Recent audio deepfake detectors achieve near-perfect accuracy in controlled settings but perform poorly in cross-domain scenarios, highlighting the need for more robust models that generalize across languages, speakers, and generation methods.
Authors:Junwoo Park, Hyuck Lee, Dohyun Lee, Daehoon Gwak, Jaegul Choo
Abstract:
Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training, fueling interest in their potential for time-series forecasting. While LLMs have shown potential in zero-shot forecasting through prompting alone, recent studies suggest that LLMs lack inherent effectiveness in forecasting. Given these conflicting findings, a rigorous validation is essential for drawing reliable conclusions. In this paper, we evaluate the effectiveness of LLMs as zero-shot forecasters compared to state-of-the-art domain-specific models. Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise, underperforming even simple domain-specific models. We have explored solutions to reduce LLMs' sensitivity to noise in the zero-shot setting, but improving their robustness remains a significant challenge. Our findings suggest that rather than emphasizing zero-shot forecasting, a more promising direction would be to focus on fine-tuning LLMs to better process numerical sequences. Our experimental code is available at https://github.com/junwoopark92/revisiting-LLMs-zeroshot-forecaster.
Chinese: 大语言模型在零样本时间序列预测中因对噪声敏感而表现不佳,相比追求零样本能力,对其进行微调以更好地处理数值序列是更有前景的方向。
English: Large Language Models (LLMs) underperform in zero-shot time-series forecasting due to noise sensitivity, and fine-tuning them for numerical data processing is a more viable approach than pursuing zero-shot capabilities.
Authors:Seohyun Park, Chitralekha Gupta, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara
Abstract:
Dysarthria, a motor speech disorder, affects intelligibility and requires targeted interventions for effective communication. In this work, we investigate automated mispronunciation feedback by collecting a dysarthric speech dataset from six speakers reading two passages, annotated by a speech therapist with temporal markers and mispronunciation descriptions. We design a three-stage framework for explainable mispronunciation evaluation: (1) overall clarity scoring, (2) mispronunciation localization, and (3) mispronunciation type classification. We systematically analyze pretrained Automatic Speech Recognition (ASR) models in each stage, assessing their effectiveness in dysarthric speech evaluation (Code available at: https://github.com/augmented-human-lab/interspeech25_speechtherapy, Supplementary webpage: https://apps.ahlab.org/interspeech25_speechtherapy/). Our findings offer clinically relevant insights for automating actionable feedback for pronunciation assessment, which could enable independent practice for patients and help therapists deliver more effective interventions.
中文: 本研究开发了一个三阶段框架用于构音障碍语音的自动评估,通过清晰度评分、发音错误定位和分类,利用预训练语音识别模型为言语治疗提供临床可行的反馈方案。
English: This study develops a three-stage framework for automated dysarthric speech evaluation, combining clarity scoring, mispronunciation localization, and classification using pretrained ASR models to provide clinically actionable feedback for speech therapy.
Authors:Hao Li, Hao Wan, Yuzhou Chen, Dongsheng Ye, Yulia Gel, Hao Jiang
Abstract:
Dynamic graphs evolve continuously, presenting challenges for traditional graph learning due to their changing structures and temporal dependencies. Recent advancements have shown potential in addressing these challenges by developing suitable meta-learning-based dynamic graph neural network models. However, most meta-learning approaches for dynamic graphs rely on fixed weight update parameters, neglecting the essential intrinsic complex high-order topological information of dynamically evolving graphs. We have designed Dowker Zigzag Persistence (DZP), an efficient and stable dynamic graph persistent homology representation method based on Dowker complex and zigzag persistence, to capture the high-order features of dynamic graphs. Armed with the DZP ideas, we propose TMetaNet, a new meta-learning parameter update model based on dynamic topological features. By utilizing the distances between high-order topological features, TMetaNet enables more effective adaptation across snapshots. Experiments on real-world datasets demonstrate TMetaNet's state-of-the-art performance and resilience to graph noise, illustrating its high potential for meta-learning and dynamic graph analysis. Our code is available at https://github.com/Lihaogx/TMetaNet.
Chinese: 本文提出TMetaNet,一种基于Dowker Zigzag持久性捕捉动态图高阶拓扑特征的新型元学习模型,实验证明其具有卓越性能和抗噪能力。
English: This paper introduces TMetaNet, a novel meta-learning model that leverages Dowker Zigzag Persistence to capture high-order topological features of dynamic graphs, achieving superior performance and noise resilience in experiments.
Authors:Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen
Abstract:
Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.
中文: 大语言模型在临床对话摘要中易产生幻觉,通用检测器效果不佳,为此开发了基于事实的专业化可解释检测方法,能有效识别真实医疗场景中的错误信息。
English: Large language models face challenges with hallucinations in clinical dialogue summarization, where general detectors prove inadequate, prompting the development of specialized, explainable metrics that effectively identify real-world medical inaccuracies.
Authors:Mehedi Ahamed, Radib Bin Kabir, Tawsif Tashwar Dipto, Mueeze Al Mushabbir, Sabbir Ahmed, Md. Hasanul Kabir
Abstract:
This study investigates the performance of few-shot learning (FSL) approaches in recognizing Bangla handwritten characters and numerals using limited labeled data. It demonstrates the applicability of these methods to scripts with intricate and complex structures, where dataset scarcity is a common challenge. Given the complexity of Bangla script, we hypothesize that models performing well on these characters can generalize effectively to languages of similar or lower structural complexity. To this end, we introduce SynergiProtoNet, a hybrid network designed to improve the recognition accuracy of handwritten characters and digits. The model integrates advanced clustering techniques with a robust embedding framework to capture fine-grained details and contextual nuances. It leverages multi-level (both high- and low-level) feature extraction within a prototypical learning framework. We rigorously benchmark SynergiProtoNet against several state-of-the-art few-shot learning models: BD-CSPN, Prototypical Network, Relation Network, Matching Network, and SimpleShot, across diverse evaluation settings including Monolingual Intra-Dataset Evaluation, Monolingual Inter-Dataset Evaluation, Cross-Lingual Transfer, and Split Digit Testing. Experimental results show that SynergiProtoNet consistently outperforms existing methods, establishing a new benchmark in few-shot learning for handwritten character and digit recognition. The code is available on GitHub: https://github.com/MehediAhamed/SynergiProtoNet.
中文摘要:本研究提出SynergiProtoNet混合小样本学习模型,通过结合聚类技术和多层次特征提取,显著提升了孟加拉语手写字符与数字的识别准确率,在多种评估场景下均优于现有先进方法。
English Summary: This study introduces SynergiProtoNet, a hybrid few-shot learning model that enhances recognition accuracy for Bangla handwritten characters and numerals by integrating clustering techniques and multi-level feature extraction, outperforming existing methods across diverse evaluation settings.
Authors:Shihao Cai, Chongming Gao, Yang Zhang, Wentao Shi, Jizhi Zhang, Keqin Bao, Qifan Wang, Fuli Feng
Abstract:
To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities. However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO's Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at https://github.com/Lanyu0303/KPO.
中文摘要:现有列表排序方法针对大语言模型优化全序一致性,但关注前K项排序在实际应用中更具实用性和可靠性,因此提出KPO方法,能动态调整不同查询的K值并提升训练效率。
English Summary: Existing list-wise methods optimize full-order rankings for LLMs, but focusing on top-K ranking consistency is more practical and reliable for real-world applications, leading to the proposed KPO method that dynamically adjusts K per query and improves training efficiency.
Authors:Tuan-Luc Huynh, Thanh-Danh Le, Tam V. Nguyen, Trung-Nghia Le, Minh-Triet Tran
Abstract:
In this paper, we address the crucial task of brain tumor segmentation in medical imaging and propose innovative approaches to enhance its performance. The current state-of-the-art nnU-Net has shown promising results but suffers from extensive training requirements and underutilization of pre-trained weights. To overcome these limitations, we integrate Axial-Coronal-Sagittal convolutions and pre-trained weights from ImageNet into the nnU-Net framework, resulting in reduced training epochs, reduced trainable parameters, and improved efficiency. Two strategies for transferring 2D pre-trained weights to the 3D domain are presented, ensuring the preservation of learned relationships and feature representations critical for effective information propagation. Furthermore, we explore a joint classification and segmentation model that leverages pre-trained encoders from a brain glioma grade classification proxy task, leading to enhanced segmentation performance, especially for challenging tumor labels. Experimental results demonstrate that our proposed methods in the fast training settings achieve comparable or even outperform the ensemble of cross-validation models, a common practice in the brain tumor segmentation literature.
中文: 本文通过将轴状-冠状-矢状卷积和预训练ImageNet权重融入nnU-Net框架,在减少训练需求的同时,实现了与现有最优方法相当甚至更优的脑肿瘤分割效果。
English: This paper enhances brain tumor segmentation by integrating Axial-Coronal-Sagittal convolutions and pre-trained ImageNet weights into nnU-Net, reducing training requirements while maintaining or exceeding state-of-the-art performance.
Authors:Seunghan Lee, Taeyoung Park, Kibok Lee
Abstract:
Channel identifiability (CID) refers to the ability to distinguish between individual channels in time series (TS) modeling. The absence of CID often results in producing identical outputs for identical inputs, disregarding channel-specific characteristics. In this paper, we highlight the importance of CID and propose Channel Normalization (CN), a simple yet effective normalization strategy that enhances CID by assigning distinct affine transformation parameters to each channel. We further extend CN in two ways: 1) Adaptive CN (ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2) Prototypical CN (PCN) introduces a set of learnable prototypes instead of per-channel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available at https://github.com/seunghan96/CN.
Chinese: 本文提出了通道归一化(CN)及其自适应与原型变体,旨在增强时间序列建模中的通道可识别性,显著提升了多种模型的性能表现。
English: This paper introduces Channel Normalization (CN) and its adaptive and prototypical variants to enhance channel identifiability in time series modeling, significantly improving model performance across various applications.
Authors:Mohammad Saqib Hasan, Saikat Chakraborty, Santu Karmaker, Niranjan Balasubramanian
Abstract:
LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at https://github.com/StonyBrookNLP/disco-lpo.
中文: 本研究提出了一种从大语言模型中提取安全与不安全代码对数据集的方法,并开发了一种局部偏好优化算法,在显著降低代码安全漏洞的同时提升了整体代码质量。
English: This study introduces a method to distill a dataset of secure and insecure code pairs from LLMs and proposes a localized preference optimization algorithm to significantly reduce security vulnerabilities while enhancing overall code quality.
Authors:Ziwen Wang
Abstract:
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein's Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL's code is available at: https://github.com/ziwenwang28/JojoSCL.
中文: JojoSCL是一种新颖的自监督对比学习框架,通过基于分层贝叶斯估计的收缩估计器和Stein无偏风险估计优化,有效降低单细胞RNA测序数据聚类中的簇内离散度,在多个数据集上均优于现有聚类方法。
English: JojoSCL is a novel self-supervised contrastive learning framework that enhances scRNA-seq clustering by reducing intra-cluster dispersion through hierarchical Bayesian estimation and Stein's Unbiased Risk Estimate, consistently outperforming existing methods across multiple datasets.
Authors:Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Abstract:
Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec.
中文:MagiCodec是一种创新的流式音频编解码器,在保持高重建质量的同时增强了编码的语义表达能力,在音频保真度和下游任务性能上均优于现有模型。
English: MagiCodec is a novel streaming audio codec that enhances semantic expressiveness in token representations while maintaining high reconstruction quality, outperforming existing models in both audio fidelity and downstream task performance.
Authors:Siavash Shams, Richard Antonello, Gavin Mischler, Stephan Bickel, Ashesh Mehta, Nima Mesgarani
Abstract:
Decoding continuous language from neural signals remains a significant challenge in the intersection of neuroscience and artificial intelligence. We introduce Neuro2Semantic, a novel framework that reconstructs the semantic content of perceived speech from intracranial EEG (iEEG) recordings. Our approach consists of two phases: first, an LSTM-based adapter aligns neural signals with pre-trained text embeddings; second, a corrector module generates continuous, natural text directly from these aligned embeddings. This flexible method overcomes the limitations of previous decoding approaches and enables unconstrained text generation. Neuro2Semantic achieves strong performance with as little as 30 minutes of neural data, outperforming a recent state-of-the-art method in low-data settings. These results highlight the potential for practical applications in brain-computer interfaces and neural decoding technologies.
Chinese: Neuro2Semantic是一种创新框架,通过两阶段方法从颅内脑电图记录中重建感知语音的语义内容,实现了无约束的文本生成,并在少量神经数据下展现出优异性能。
English: Neuro2Semantic is an innovative framework that reconstructs the semantic content of perceived speech from intracranial EEG recordings using a two-phase approach, enabling unconstrained text generation and achieving strong performance with minimal neural data.
Authors:Yubai Wei, Jiale Han, Yi Yang
Abstract:
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.
中文摘要:BMEmbed通过利用BM25检索技术构建监督信号,有效提升通用文本嵌入模型在私有数据集上的检索性能,实现了跨领域的一致改进。
English Summary: BMEmbed enhances general-purpose text embedding models for private datasets by using BM25-based retrieval signals to improve domain-specific performance, consistently boosting retrieval effectiveness across various domains.
Authors:Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo
Abstract:
While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.
中文: 该摘要介绍了AVROBUSTBENCH这一基准测试,用于评估视听模型在双模态同时受损时的鲁棒性,发现现有模型和适应方法对此类变化表现不佳,并提出了一种新的方法显示出改进效果。
English: The abstract introduces AVROBUSTBENCH, a benchmark for evaluating audio-visual models' robustness to simultaneous corruptions in both modalities, revealing that current models and adaptation methods struggle with such shifts, while proposing a new approach that shows improvement.
Authors:Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair
Abstract:
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to \latencyimprv end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \href{https://github.com/STAR-Laboratory/foresight}{https://github.com/STAR-Laboratory/foresight}.
Chinese: Foresight 提出了一种自适应层重用技术,在保持视频生成质量的同时,有效降低了扩散变换器的计算成本,显著提升了 OpenSora 等模型的运行速度。
English: Foresight introduces an adaptive layer-reuse technique that reduces computational costs in Diffusion Transformers for video generation while maintaining quality, achieving significant speed improvements in models like OpenSora.
Authors:Long Xu, Peng Gao, Wen-Jia Tang, Fei Wang, Ru-Yue Yuan
Abstract:
Although deep learning-based visual tracking methods have made significant progress, they exhibit vulnerabilities when facing carefully designed adversarial attacks, which can lead to a sharp decline in tracking performance. To address this issue, this paper proposes for the first time a novel adversarial defense method based on denoise diffusion probabilistic models, termed DiffDf, aimed at effectively improving the robustness of existing visual tracking methods against adversarial attacks. DiffDf establishes a multi-scale defense mechanism by combining pixel-level reconstruction loss, semantic consistency loss, and structural similarity loss, effectively suppressing adversarial perturbations through a gradual denoising process. Extensive experimental results on several mainstream datasets show that the DiffDf method demonstrates excellent generalization performance for trackers with different architectures, significantly improving various evaluation metrics while achieving real-time inference speeds of over 30 FPS, showcasing outstanding defense performance and efficiency. Codes are available at https://github.com/pgao-lab/DiffDf.
中文摘要:本文首次提出基于去噪扩散概率模型的DiffDf防御方法,通过多尺度防御机制有效提升视觉跟踪器对抗攻击的鲁棒性,并在保持实时性能的同时显著改善各项评估指标。
English Summary: This paper introduces DiffDf, a novel adversarial defense method using denoising diffusion models to significantly enhance the robustness of visual trackers against attacks while maintaining real-time performance.
Authors:Sofiane Mahiou, Amir Dizche, Reza Nazari, Xinmin Wu, Ralph Abbey, Jorge Silva, Georgi Ganev
Abstract:
We propose dpmm, an open-source library for synthetic data generation with Differentially Private (DP) guarantees. It includes three popular marginal models -- PrivBayes, MST, and AIM -- that achieve superior utility and offer richer functionality compared to alternative implementations. Additionally, we adopt best practices to provide end-to-end DP guarantees and address well-known DP-related vulnerabilities. Our goal is to accommodate a wide audience with easy-to-install, highly customizable, and robust model implementations.
Our codebase is available from https://github.com/sassoftware/dpmm.
中文: 我们推出dpmm开源库,用于生成具有差分隐私保障的合成数据,包含三种高效边际模型,并提供端到端的隐私保护功能。
English: We introduce dpmm, an open-source library for generating synthetic data with differential privacy guarantees, featuring three high-utility marginal models and robust end-to-end privacy protections.
Authors:Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg
Abstract:
Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chain-of-thoughts (CoT) about the content of input images and videos. In this work, we propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames. For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces about both natural and synthetic videos, spanning various topics and tasks. Then, we fine-tune existing video LLMs on this chain-of-frames (CoF) data. Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames. We show that our models based on CoF are able to generate chain-of-thoughts that accurately refer to the key frames to answer the given question. This, in turn, leads to improved performance across multiple video understanding benchmarks, for example, surpassing leading video LLMs on Video-MME, MVBench, and VSI-Bench, and notably reducing the hallucination rate. Code available at https://github.com/SaraGhazanfari/CoF}{github.com/SaraGhazanfari/CoF.
Chinese Summary: 近期研究通过训练视频大语言模型生成明确基于相关视频帧的推理轨迹,无需辅助网络即可在多项基准测试中显著提升性能并减少幻觉现象。
English Summary: Recent research enhances video large language models by training them to generate reasoning traces explicitly grounded in relevant video frames, leading to improved performance on multiple benchmarks without needing auxiliary networks.
Authors:Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Abstract:
Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at https://github.com/KurbanIntelligenceLab/MultiCrystalSpectrumSet.
中文摘要:本文提出的MCS-Set框架通过整合原子结构、二维投影和文本标注,构建多模态材料科学数据集,支持属性预测与晶体生成等任务,为多模态模型基准测试和材料数据分析提供新基础。
English Summary: This paper introduces MCS-Set, a multimodal framework that enhances materials science datasets by integrating atomic structures with visual projections and textual annotations to enable advanced machine learning applications like property prediction and crystal generation.
Authors:Boshra Khajehpiri, Eric Granger, Massimiliano de Zambotti, Fiona C. Baker, Mohamad Forouzanfar
Abstract:
Despite extensive research on the relationship between sleep and cognition, the connection between sleep microstructure and human performance across specific cognitive domains remains underexplored. This study investigates whether deep learning models can predict executive functions, particularly cognitive adaptability and conceptual reasoning from physiological processes during a night's sleep. To address this, we introduce CogPSGFormer, a multi-scale convolutional-transformer model designed to process multi-modal polysomnographic data. This model integrates one-channel ECG and EEG signals along with extracted features, including EEG power bands and heart rate variability parameters, to capture complementary information across modalities. A thorough evaluation of the CogPSGFormer architecture was conducted to optimize the processing of extended sleep signals and identify the most effective configuration. The proposed framework was evaluated on 817 individuals from the STAGES dataset using cross-validation. The model achieved 80.3\% accuracy in classifying individuals into low vs. high cognitive performance groups on unseen data based on Penn Conditional Exclusion Test (PCET) scores. These findings highlight the effectiveness of our multi-scale feature extraction and multi-modal learning approach in leveraging sleep-derived signals for cognitive performance prediction. To facilitate reproducibility, our code is publicly accessible (https://github.com/boshrakh95/CogPSGFormer.git).
中文: 本研究提出的CogPSGFormer深度学习模型通过分析多模态睡眠数据,能以80.3%的准确率预测认知表现,揭示了睡眠微结构在评估执行功能方面的重要价值。
English: This study introduces CogPSGFormer, a deep learning model that analyzes multi-modal sleep data to predict cognitive performance with 80.3% accuracy, demonstrating the potential of sleep microstructure for assessing executive functions.
Authors:Dang Nguyen, Ali Payani, Baharan Mirzasoleiman
Abstract:
Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at https://github.com/BigML-CS-UCLA/SNNE.
中文摘要:本文提出了一种新的黑盒不确定性量化方法,通过考虑簇内和簇间相似性改进了语义熵,在多种大语言模型和文本生成任务中展现出更优性能。
English Summary: This paper introduces a new black-box uncertainty quantification method that improves upon semantic entropy by accounting for intra-cluster and inter-cluster similarities, demonstrating superior performance across multiple LLMs and text generation tasks.
Authors:Rebekah A. GelpÃ, Yibing Ju, Ethan C. Jackson, Yikai Tang, Shon Verch, Claas Voelcker, William A. Cunningham
Abstract:
We introduce Sorrel (https://github.com/social-ai-uoft/sorrel), a simple Python interface for generating and testing new multi-agent reinforcement learning environments. This interface places a high degree of emphasis on simplicity and accessibility, and uses a more psychologically intuitive structure for the basic agent-environment loop, making it a useful tool for social scientists to investigate how learning and social interaction leads to the development and change of group dynamics. In this short paper, we outline the basic design philosophy and features of Sorrel.
中文: Sorrel是一个简洁的Python接口,专注于生成和测试多智能体强化学习环境,其设计强调易用性和心理直观性,帮助社会科学家通过学习和社交互动研究群体动态的发展与变化。
English: Sorrel is a user-friendly Python interface designed for creating and evaluating multi-agent reinforcement learning environments, emphasizing simplicity and a psychologically intuitive structure to aid social scientists in studying group dynamics through learning and social interactions.
Authors:Bernardo Subercaseaux, Ethan Mackey, Long Qian, Marijn J. H. Heule
Abstract:
We present a computational methodology for obtaining rotationally symmetric sets of points satisfying discrete geometric constraints, and demonstrate its applicability by discovering new solutions to some well-known problems in combinatorial geometry. Our approach takes the usage of SAT solvers in discrete geometry further by directly embedding rotational symmetry into the combinatorial encoding of geometric configurations. Then, to realize concrete point sets corresponding to abstract designs provided by a SAT solver, we introduce a novel local-search realizability solver, which shows excellent practical performance despite the intrinsic $\exists \mathbb{R}$-completeness of the problem. Leveraging this combined approach, we provide symmetric extremal solutions to the ErdÅs-Szekeres problem, as well as a minimal odd-sized solution with 21 points for the everywhere-unbalanced-points problem, improving on the previously known 23-point configuration. The imposed symmetries yield more aesthetically appealing solutions, enhancing human interpretability, and simultaneously offer computational benefits by significantly reducing the number of variables required to encode discrete geometric problems.
Chinese: 本研究提出一种计算方法,将旋转对称性嵌入SAT编码,并利用局部搜索求解器高效发现对称几何构型,为经典问题如Erdős-Szekeres问题和处处不平衡点问题提供了更优解。
English: This study introduces a computational approach that embeds rotational symmetry into SAT-based encodings and employs a local-search solver to efficiently discover symmetric geometric configurations, yielding improved solutions for classic problems like the Erdős-Szekeres and everywhere-unbalanced-points problems.
Authors:Anoop Kini, Andreas Jansche, Timo Bernthaler, Gerhard Schneider
Abstract:
FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite the non-triviality of task heterogeneity with only a subtle correlation. The approach addresses the classification of a detected object (occupying the entire image frame) and regression for modeling a continuous property variable (for instances of an object class), a crucial use case in science and engineering. FastCAR involves a label transformation approach that is amenable for use with only a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.4%). The experiments performed used "Advanced Steel Property Dataset" contributed by us https://github.com/fastcandr/AdvancedSteel-Property-Dataset. The dataset comprises 4536 images of 224x224 pixels, annotated with discrete object classes and its hardness property that can take continuous values. Our proposed FastCAR approach for task consolidation achieves training time efficiency (2.52x quicker) and reduced inference latency (55% faster) than benchmark MTL networks.
Chinese: FastCAR是一种新颖的多任务学习方法,通过标签转换技术有效整合分类与回归任务,在达到99.54%分类精度和2.4%回归误差的同时,相比传统方法训练速度提升2.52倍、推理延迟降低55%。
English: FastCAR is a novel multi-task learning approach that efficiently consolidates classification and regression tasks using a label transformation method, achieving superior performance with 99.54% classification accuracy and 2.4% regression error while being 2.52x faster in training and 55% quicker in inference than traditional methods.
Authors:Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, Sida Wang
Abstract:
Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at https://github.com/gonglinyuan/ast_fim.
中文: AST-FIM是一种利用抽象语法树掩码完整语法结构的新型预训练方法,在代码填充任务中比标准随机字符掩码性能提升高达5个百分点,更贴合实际代码编辑模式。
English: AST-FIM is a novel pretraining method that uses Abstract Syntax Trees to mask complete syntactic structures, outperforming standard random-character masking by up to 5 points on fill-in-the-middle benchmarks and better aligning with real-world code editing patterns.
Authors:Edward Fish, Richard Bowden
Abstract:
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
中文: 本研究提出Geo-Sign方法,通过将骨骼特征投影到双曲空间来增强手语翻译中的骨架表示,从而更好地捕捉层次化运动结构,在提升精度的同时提高了计算效率。
English: This work introduces Geo-Sign, a method that enhances skeletal representations for sign language translation by projecting features into hyperbolic space to better capture hierarchical kinematic structures, improving both accuracy and computational efficiency over existing approaches.
Authors:Liangrui Pan, Qingchun Liang, Shen Zhao, Songqing Fan, Shaoliang Peng
Abstract:
Accurately predicting gene mutations, mutation subtypes and their exons in lung cancer is critical for personalized treatment planning and prognostic assessment. Faced with regional disparities in medical resources and the high cost of genomic assays, using artificial intelligence to infer these mutations and exon variants from routine histopathology images could greatly facilitate precision therapy. Although some prior studies have shown that deep learning can accelerate the prediction of key gene mutations from lung cancer pathology slides, their performance remains suboptimal and has so far been limited mainly to early screening tasks. To address these limitations, we have assembled PathGene, which comprises histopathology images paired with next-generation sequencing reports from 1,576 patients at the Second Xiangya Hospital, Central South University, and 448 TCGA-LUAD patients. This multi-center dataset links whole-slide images to driver gene mutation status, mutation subtypes, exon, and tumor mutational burden (TMB) status, with the goal of leveraging pathology images to predict mutations, subtypes, exon locations, and TMB for early genetic screening and to advance precision oncology. Unlike existing datasets, we provide molecular-level information related to histopathology images in PathGene to facilitate the development of biomarker prediction models. We benchmarked 11 multiple-instance learning methods on PathGene for mutation, subtype, exon, and TMB prediction tasks. These experimental methods provide valuable alternatives for early genetic screening of lung cancer patients and assisting clinicians to quickly develop personalized precision targeted treatment plans for patients. Code and data are available at https://github.com/panliangrui/NIPS2025/.
中文: PathGene多中心数据集整合病理图像与基因组数据,通过评估11种机器学习方法,实现了基于人工智能的肺癌基因突变、亚型、外显子位置及肿瘤突变负荷预测,以推动精准肿瘤学发展。
English: The PathGene dataset integrates histopathology images with genomic data from multiple centers to enable AI-based prediction of lung cancer gene mutations, subtypes, exon locations, and tumor mutational burden, benchmarking 11 machine learning methods to advance precision oncology.
Authors:Hyundong Jin, Sicheol Sung, Shinwoo Park, SeungYeop Baik, Yo-Sub Han
Abstract:
The reasoning, writing, text-editing, and retrieval capabilities of proprietary large language models (LLMs) have advanced rapidly, providing users with an ever-expanding set of functionalities. However, this growing utility has also led to a serious societal concern: the over-reliance on LLMs. In particular, users increasingly delegate tasks such as homework, assignments, or the processing of sensitive documents to LLMs without meaningful engagement. This form of over-reliance and misuse is emerging as a significant social issue. In order to mitigate these issues, we propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect. Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users. Through empirical evaluation, we demonstrate the effectiveness of our framework on proprietary LLMs, comparing its impact against several baselines. TRAPDOC serves as a strong foundation for promoting more responsible and thoughtful engagement with language models. Our code is available at https://github.com/jindong22/TrapDoc.
中文摘要:专有大语言模型的快速发展导致用户过度依赖,为此我们提出TRAPDOC框架,通过注入隐形幻影令牌使模型生成看似合理实则错误的输出,从而促进用户更负责任地使用人工智能系统。
English Summary: The rapid advancement of proprietary large language models has led to user over-reliance, prompting the development of TRAPDOC—a framework using imperceptible phantom tokens to generate plausible but incorrect outputs, thereby encouraging more responsible engagement with AI systems.
Authors:Hyundong Jin, Sicheol Sung, Shinwoo Park, SeungYeop Baik, Yo-Sub Han
Abstract:
The reasoning, writing, text-editing, and retrieval capabilities of proprietary large language models (LLMs) have advanced rapidly, providing users with an ever-expanding set of functionalities. However, this growing utility has also led to a serious societal concern: the over-reliance on LLMs. In particular, users increasingly delegate tasks such as homework, assignments, or the processing of sensitive documents to LLMs without meaningful engagement. This form of over-reliance and misuse is emerging as a significant social issue. In order to mitigate these issues, we propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect. Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users. Through empirical evaluation, we demonstrate the effectiveness of our framework on proprietary LLMs, comparing its impact against several baselines. TRAPDOC serves as a strong foundation for promoting more responsible and thoughtful engagement with language models. Our code is available at https://github.com/jindong22/TrapDoc.
中文摘要:专有大语言模型的快速发展导致用户过度依赖,为此我们提出TRAPDOC框架,通过注入隐形幻影令牌使模型生成看似合理实则错误的输出,从而促进用户更负责任地使用人工智能系统。
English Summary: The rapid advancement of proprietary large language models has led to user over-reliance, prompting the development of TRAPDOC—a framework using imperceptible phantom tokens to generate plausible but incorrect outputs, thereby encouraging more responsible engagement with AI systems.
Authors:Dipam Goswami, Liying Wang, BartÅomiej Twardowski, Joost van de Weijer
Abstract:
Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embeddings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing. Code is available at https://github.com/dipamgoswami/QDC.
中文: 本研究针对动态数据场景下嵌入模型不兼容的问题,提出查询漂移补偿方法,将新查询嵌入映射至旧模型空间,无需重新索引即可保持检索性能,显著提升系统效率。
English: This study addresses the challenge of embedding model incompatibility in dynamic data scenarios by proposing a query drift compensation method that projects new query embeddings into the old model's space, eliminating the need for costly re-indexing while maintaining retrieval performance.
Authors:Dipam Goswami, Liying Wang, Bartłomiej Twardowski, Joost van de Weijer
Abstract:
Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embeddings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing. Code is available at https://github.com/dipamgoswami/QDC.
中文: 本研究针对动态数据场景下嵌入模型不兼容的问题,提出查询漂移补偿方法,将新查询嵌入映射至旧模型空间,无需重新索引即可保持检索性能,显著提升系统效率。
English: This study addresses the challenge of embedding model incompatibility in dynamic data scenarios by proposing a query drift compensation method that projects new query embeddings into the old model's space, eliminating the need for costly re-indexing while maintaining retrieval performance.
Authors:Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, Kai Huang
Abstract:
Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.
中文: GaussianFusion提出了一种基于高斯表示的多传感器融合框架,通过整合显式和隐式特征来增强自动驾驶系统的轨迹规划能力和鲁棒性。
English: GaussianFusion introduces a multi-sensor fusion framework using Gaussian representations to enhance autonomous driving by integrating explicit and implicit features for improved trajectory planning and robustness.
Authors:Dingjun Wu, Yukun Yan, Zhenghao Liu, Zhiyuan Liu, Maosong Sun
Abstract:
Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.
Chinese: KG-Infused RAG通过整合知识图谱和认知扩散激活机制,显著提升了检索增强生成的性能,在多个基准测试中表现优异,并展现出作为即插即用增强模块的广泛适用性。
English: KG-Infused RAG enhances retrieval-augmented generation by integrating knowledge graphs and cognitive spreading activation, achieving significant performance gains across benchmarks and demonstrating versatility as a plug-and-play module.
Authors:MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengda Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Baoxi Ji, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Xin Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Lushi Pu, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Xiaoyue Xu, Yukun Yan, Jiarui Yuan, Jinqian Zhang, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Chuyue Zhou, Ge Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun
Abstract:
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Furthermore, we construct a hybrid reasoning model, MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform similar-sized open-source models across benchmarks, with the 8B variants showing significant speed improvements on long sequence understanding and generation.
中文:MiniCPM4是一款专为终端设备设计的高效大语言模型,通过架构、训练数据、算法和推理系统的创新,以精简参数量在多项基准测试中超越同类开源模型,并显著提升长序列处理速度。
English: MiniCPM4 is an efficient large language model optimized for end-side devices through innovations in architecture, training data, algorithms, and inference systems, achieving superior performance and speed with compact parameter sizes.
Authors:Yongjian Li, HaoCheng Chu, Yukun Yan, Zhenghao Liu, Shi Yu, Zheni Zeng, Ruobing Wang, Sen Song, Zhiyuan Liu, Maosong Sun
Abstract:
Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.
中文: KARE-RAG通过结构化知识表示、优化训练目标和对比数据生成三大创新,有效提升了检索增强生成的知识利用效率,在各类任务中显著改进性能且不损害模型通用能力。
English: KARE-RAG enhances retrieval-augmented generation by improving knowledge utilization through structured representations, refined training objectives, and contrastive data generation, significantly boosting performance across tasks without compromising general capabilities.
Authors:Yixu Chen, Bowen Chen, Hai Wei, Alan C. Bovik, Baojun Li, Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Dounia Hammou, Fei Yin, Rafal Mantiuk, Amritha Premkumar, Prajit T Rajendran, Vignesh V Menon
Abstract:
This paper reports IEEE International Conference on Multimedia \& Expo (ICME) 2025 Grand Challenge on Generalizable HDR and SDR Video Quality Measurement. With the rapid development of video technology, especially High Dynamic Range (HDR) and Standard Dynamic Range (SDR) contents, the need for robust and generalizable Video Quality Assessment (VQA) methods has become increasingly demanded. Existing VQA models often struggle to deliver consistent performance across varying dynamic ranges, distortion types, and diverse content. This challenge was established to benchmark and promote VQA approaches capable of jointly handling HDR and SDR content. In the final evaluation phase, five teams submitted seven models along with technical reports to the Full Reference (FR) and No Reference (NR) tracks. Among them, four methods outperformed VMAF baseline, while the top-performing model achieved state-of-the-art performance, setting a new benchmark for generalizable video quality assessment.
中文:本文介绍了ICME 2025关于可泛化HDR与SDR视频质量评估的大挑战,旨在推动能同时处理两种动态范围内容的评估方法,其中最优模型刷新了性能基准。
English: This paper introduces the ICME 2025 Grand Challenge focused on developing generalizable Video Quality Assessment methods that effectively handle both HDR and SDR content, where top-performing models surpassed existing benchmarks.
Authors:Xiaohong Liu, Xiongkuo Min, Qiang Hu, Xiaoyun Zhang, Jie Guo, Guangtao Zhai, Shushi Wang, Yingjie Zhou, Lu Liu, Jingxin Li, Liu Yang, Farong Wen, Li Xu, Yanwei Jiang, Xilei Zhu, Chunyi Li, Zicheng Zhang, Huiyu Duan, Xiele Wu, Yixuan Gao, Yuqin Cao, Jun Jia, Wei Sun, Jiezhang Cao, Radu Timofte, Baojun Li, Jiamian Huang, Dan Luo, Tao Liu, Weixia Zhang, Bingkun Zheng, Junlin Chen, Ruikai Zhou, Meiya Chen, Yu Wang, Hao Jiang, Xiantao Li, Yuxiang Jiang, Jun Tang, Yimeng Zhao, Bo Hu, Zelu Qi, Chaoyang Zhang, Fei Zhao, Ping Shi, Lingzhi Fu, Heng Cong, Shuai He, Rongyu Zhang, Jiarong He, Zongyao Hu, Wei Luo, Zihao Yu, Fengbin Guan, Yiting Lu, Xin Li, Zhibo Chen, Mengjing Su, Yi Wang, Tuo Chen, Chunxiao Li, Shuaiyu Zhao, Jiaxin Wen, Chuyi Lin, Sitong Liu, Ningxin Chu, Jing Wan, Yu Zhou, Baoying Chen, Jishen Zeng, Jiarui Liu, Xianjin Liu, Xin Chen, Lanzhi Zhou, Hangyu Li, You Han, Bibo Xiang, Zhenjie Liu, Jianzhang Lu, Jialin Gui, Renjie Lu, Shangfei Wang, Donghao Zhou, Jingyu Lin, Quanjian Song, Jiancheng Huang, Yufeng Yang, Changwei Wang, Shupeng Zhong, Yang Yang, Lihuo He, Jia Liu, Yuting Xing, Tida Fang, Yuchun Jin
Abstract:
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
中文: NTIRE 2025 XGC质量评估挑战赛在CVPR 2025举办,包含用户生成视频、AI生成视频和说话头像三个赛道,所有参赛团队均超越基线表现,推动了相关领域的发展。
English: The NTIRE 2025 XGC Quality Assessment Challenge at CVPR 2025 features three tracks—user-generated videos, AI-generated videos, and talking heads—where all participating teams surpassed baseline performance, advancing the field.
Authors:Jinbo Wen, Cheng Su, Jiawen Kang, Jiangtian Nie, Yang Zhang, Jianhang Tang, Dusit Niyato, Chau Yuen
Abstract:
Low-Altitude Economy Networks (LAENets) are emerging as a promising paradigm to support various low-altitude services through integrated air-ground infrastructure. To satisfy low-latency and high-computation demands, the integration of Unmanned Aerial Vehicles (UAVs) with Mobile Edge Computing (MEC) systems plays a vital role, which offloads computing tasks from terminal devices to nearby UAVs, enabling flexible and resilient service provisions for ground users. To promote the development of LAENets, it is significant to achieve low-carbon multi-UAV-assisted MEC networks. However, several challenges hinder this implementation, including the complexity of multi-dimensional UAV modeling and the difficulty of multi-objective coupled optimization. To this end, this paper proposes a novel Retrieval Augmented Generation (RAG)-based Large Language Model (LLM) agent framework for model formulation. Specifically, we develop HybridRAG by combining KeywordRAG, VectorRAG, and GraphRAG, empowering LLM agents to efficiently retrieve structural information from expert databases and generate more accurate optimization problems compared with traditional RAG-based LLM agents. After customizing carbon emission optimization problems for multi-UAV-assisted MEC networks, we propose a Double Regularization Diffusion-enhanced Soft Actor-Critic (R\textsuperscript{2}DSAC) algorithm to solve the formulated multi-objective optimization problem. The R\textsuperscript{2}DSAC algorithm incorporates diffusion entropy regularization and action entropy regularization to improve the performance of the diffusion policy. Furthermore, we dynamically mask unimportant neurons in the actor network to reduce the carbon emissions associated with model training. Simulation results demonstrate the effectiveness and reliability of the proposed HybridRAG-based LLM agent framework and the R\textsuperscript{2}DSAC algorithm.
This paper introduces a HybridRAG-enhanced LLM agent framework to formulate optimization models for low-carbon multi-UAV MEC networks, along with an R²DSAC algorithm that reduces carbon emissions through dual entropy regularization and neural network optimization.
English Summary:
Authors:Ying Zhang, Yu Zhao, Xuhui Sui, Baohang Zhou, Xiangrui Cai, Li Shen, Xiaojie Yuan, Dacheng Tao
Abstract:
With the increasing multimodal knowledge privatization requirements, multimodal knowledge graphs in different institutes are usually decentralized, lacking of effective collaboration system with both stronger reasoning ability and transmission safety guarantees. In this paper, we propose the Federated Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over federated MKGs for better predicting the missing links in clients without sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for addressing multimodal uncertain unavailability and multimodal client heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete multimodal distributions from incomplete entity embeddings constrained by available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual Distillation (MMFeD3) transfers knowledge mutually between clients and the server with logit and feature distillation to improve both global convergence and semantic consistency. We propose a FedMKGC benchmark for a comprehensive evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets with heterogeneous multimodal information, and three groups of constructed baselines. Experiments conducted on our benchmark validate the effectiveness, semantic consistency, and convergence robustness of MMFeD3-HidE.
中文摘要:本文提出联邦多模态知识图谱补全任务及MMFeD3-HidE框架,通过客户端内部的多模态嵌入恢复和客户端间的双向蒸馏机制,在保护隐私的前提下实现分布式多模态知识图谱的协同推理与补全。
English Summary: This paper introduces the FedMKGC task and proposes the MMFeD3-HidE framework to address multimodal knowledge graph completion across decentralized institutions while preserving data privacy through local training and mutual knowledge distillation between clients and server.
Authors:Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, Dacheng Tao
Abstract:
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
中文摘要:通过简单的一次性随机剪枝引入静态网络稀疏性,使深度强化学习模型在扩展潜力上超越密集网络,在多种场景下实现了更高的参数效率和更强的训练稳定性。
English Summary: Introducing static network sparsity through simple one-shot random pruning enables deep reinforcement learning models to surpass dense networks in scaling potential, achieving greater parameter efficiency and resilience to training challenges across various scenarios.
Authors:Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang
Abstract:
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model's probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.
中文: 后训练方法中,SFT能快速学习新任务但导致严重遗忘先验知识,而RFT学习较慢却能更好保留知识,研究表明训练数据分布而非算法差异是影响遗忘的关键因素。
English: Post-training methods like SFT and RFT show a trade-off where SFT quickly learns new tasks but causes severe forgetting of prior knowledge, while RFT learns slower but preserves knowledge better, with training data distribution being key to mitigating forgetting in multimodal models.
Authors:Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen
Abstract:
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model's probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.
中文: 后训练方法中,SFT能快速学习新任务但导致严重遗忘先验知识,而RFT学习较慢却能更好保留知识,研究表明训练数据分布而非算法差异是影响遗忘的关键因素。
English: Post-training methods like SFT and RFT show a trade-off where SFT quickly learns new tasks but causes severe forgetting of prior knowledge, while RFT learns slower but preserves knowledge better, with training data distribution being key to mitigating forgetting in multimodal models.
Authors:Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Abstract:
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
Chinese: Agent-RewardBench 是一个旨在评估多模态大语言模型中奖励建模能力的基准,通过多维场景和步骤级评估来解决智能体自我纠正和泛化能力不足的问题。
English: Agent-RewardBench is a benchmark designed to evaluate reward modeling in Multimodal Large Language Models, featuring multi-dimensional scenarios and step-level assessments to address agents' self-correction and generalization challenges.
Authors:Ao Chang, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Abstract:
Legal Judgment Prediction (LJP) aims to predict judicial outcomes, including relevant legal charge, terms, and fines, which is a crucial process in Large Language Model(LLM). However, LJP faces two key challenges: (1)Long Tail Distribution: Current datasets, derived from authentic cases, suffer from high human annotation costs and imbalanced distributions, leading to model performance degradation. (2)Lawyer's Improvement: Existing systems focus on enhancing judges' decision-making but neglect the critical role of lawyers in refining arguments, which limits overall judicial accuracy. To address these issues, we propose an Adversarial Self-Play Lawyer Augmented Legal Judgment Framework, called ASP2LJ, which integrates a case generation module to tackle long-tailed data distributions and an adversarial self-play mechanism to enhance lawyers' argumentation skills. Our framework enables a judge to reference evolved lawyers' arguments, improving the objectivity, fairness, and rationality of judicial decisions. Besides, We also introduce RareCases, a dataset for rare legal cases in China, which contains 120 tail-end cases. We demonstrate the effectiveness of our approach on the SimuCourt dataset and our RareCases dataset. Experimental results show our framework brings improvements, indicating its utilization. Our contributions include an integrated framework, a rare-case dataset, and publicly releasing datasets and code to support further research in automated judicial systems.
中文: ASP2LJ框架通过生成罕见案例解决数据分布不平衡问题,并采用对抗性自我博弈强化律师论证能力,从而提升司法决策的公平性与合理性,在专业数据集上验证了其有效性。
English: The ASP2LJ framework addresses Legal Judgment Prediction challenges by generating rare cases to balance data distribution and using adversarial self-play to enhance lawyer arguments, ultimately improving judicial decision fairness and effectiveness, as validated on specialized datasets.
Authors:Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.
中文摘要:语音语言模型通过解耦分词和多令牌预测技术,显著提升了跨模态对齐与语音生成质量,同时说话人感知生成范式增强了知识理解与说话人一致性。
English Summary: Speech-language models are advanced by decoupled tokenization and multi-token prediction, which improve alignment, synthesis quality, and decoding speed, while a speaker-aware paradigm enhances knowledge understanding and speaker consistency.
Authors:Ziyang Luo, Nian Liu, Xuguang Yang, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Junwei Han
Abstract:
Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
Chinese: TAViS框架通过文本桥接设计,结合混合提示机制和对齐监督策略,有效解决了音视频分割中的跨模态对齐难题,在多种数据集和零样本场景下均表现出卓越性能。
English: The TAViS framework effectively addresses cross-modal alignment challenges in Audio-Visual Segmentation by integrating multimodal foundation models through a text-bridged design that combines hybrid prompting and alignment supervision for superior performance across various settings.
Authors:Chenlong Zhang, Zhuoran Jin, Hongbang Yuan, Jiaheng Wei, Tong Zhou, Kang Liu, Jun Zhao, Yubo Chen
Abstract:
The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget--related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only $12%$ forget set and $8%$ synthesized boundary data, RULE outperforms existing baselines by up to $17.5%$ forget quality and $16.3%$ naturalness response while maintaining general utility, achieving forget--retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.
中文: 本文提出强化反学习框架,通过优化拒绝边界实现大语言模型定向信息消除,仅需少量数据即可在保持模型性能的同时提升回答自然度。
English: This paper introduces Reinforcement UnLearning (RULE), an efficient framework that achieves targeted information removal in Large Language Models by optimizing refusal boundaries, using minimal data while preserving model utility and improving response naturalness.
Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Abstract:
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
中文:本文提出了ViMUL-Bench多语言视频基准测试,用于评估涵盖14种语言和多元文化类别的大型多模态模型,同时开发的新型多语言视频LMM显著提升了语言包容性。
English: This paper introduces ViMUL-Bench, a multilingual video benchmark evaluating large multimodal models across 14 languages and diverse cultural categories, along with a new multilingual video LMM that improves language inclusivity.
Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Abstract:
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
中文:本文提出了ViMUL-Bench多语言视频基准测试,用于评估涵盖14种语言和多元文化类别的大型多模态模型,同时开发的新型多语言视频LMM显著提升了语言包容性。
English: This paper introduces ViMUL-Bench, a multilingual video benchmark evaluating large multimodal models across 14 languages and diverse cultural categories, along with a new multilingual video LMM that improves language inclusivity.
Authors:Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan
Abstract:
Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA
中文摘要:VideoMathQA是一个评估模型跨模态数学推理能力的基准,通过整合视频中的视觉、音频和文本信息,涵盖10个领域,采用专家标注的多步骤问题来应对现实场景中的数学推理挑战。
English Summary: VideoMathQA is a benchmark designed to evaluate models' ability to perform cross-modal mathematical reasoning by integrating visual, audio, and textual information from videos across 10 domains, addressing real-world challenges through expert-annotated questions that require multi-step problem-solving.
Authors:Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang
Abstract:
Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
中文: 本研究提出RIVAL对抗训练框架,通过迭代更新奖励模型与语言模型并融合定性与定量奖励,解决了口语字幕翻译中的性能退化问题,实现了稳定且符合人类评估的翻译质量提升。
English: The study introduces RIVAL, an adversarial training framework that addresses the performance decline in colloquial subtitle translation by aligning the reward model and language model through iterative updates and integrating qualitative and quantitative rewards for stable, human-aligned results.
Authors:Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Abstract:
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
中文: MMR-V基准测试旨在通过远距离多帧推理任务挑战多模态大语言模型,超越简单感知,揭示了当前模型的困难以及现有推理策略带来的有限提升。
English: The MMR-V benchmark is introduced to challenge multimodal large language models with long-range, multi-frame reasoning tasks that go beyond simple perception, revealing current models' struggles and limited gains from existing reasoning strategies.
Authors:Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Yonghui Wu, Xuanjing Huang
Abstract:
We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
中文: EvaLearn是一个创新基准,通过让大语言模型按顺序解决648项挑战性任务来评估其学习能力和效率,揭示了模型间的表现差异,并强调了超越静态能力的新评估维度。
English: EvaLearn is a novel benchmark that assesses large language models' learning capability and efficiency through sequential problem-solving across 648 challenging tasks, revealing varied performance among models and highlighting a new dimension of evaluation beyond static abilities.
Authors:Fangyu Lei, Jinxiang Meng, Yiming Huang, Tinghong Chen, Yun Zhang, Shizhu He, Jun Zhao, Kang Liu
Abstract:
Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.
中文: Reasoning-Table首次将强化学习应用于表格推理,通过基于规则的奖励设计和跨任务统一训练,在多个基准测试中超越监督学习方法及更大模型,实现了最先进的性能。
English: Reasoning-Table introduces the first reinforcement learning approach to table reasoning, achieving state-of-the-art performance by using rule-based rewards and unified training across tasks to surpass supervised methods and larger models in benchmarks.
Authors:Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Abstract:
Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.
中文: 本文提出MoMa高效适配器框架,通过将Mamba的选择性状态空间建模集成到图像基础模型中,利用新颖的SeqMod操作和分治调制架构实现完整的时空视频理解,在多项基准测试中以更低计算成本展现出优越性能。
English: The paper introduces MoMa, an efficient adapter framework that integrates Mamba's selective state space modeling into image foundation models to achieve full spatial-temporal video understanding through a novel SeqMod operation and Divide-and-Modulate architecture, demonstrating superior performance with reduced computational cost in experiments.
Authors:Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie
Abstract:
Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence.
DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser's 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.
中文摘要:DeepRare是首个基于大语言模型的罕见病诊断智能系统,通过处理多源临床数据生成带透明推理链的诊断假设,在多项评估中展现出卓越的诊断性能并显著优于现有方法。
English Summary: DeepRare is a pioneering LLM-powered diagnostic system that addresses rare disease challenges by generating ranked hypotheses with transparent reasoning chains, demonstrating superior accuracy and outperforming existing methods across multiple evaluations.
Authors:Ziheng Zhao, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie
Abstract:
Automated interpretation of CT images-particularly localizing and describing abnormal findings across multi-plane and whole-body scans-remains a significant challenge in clinical radiology. This work aims to address this challenge through four key contributions: (i) On taxonomy, we collaborate with senior radiologists to propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions; (ii) On data, we contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities, each linked to the detailed description and cast into the taxonomy; (iii) On model development, we propose OminiAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries, while also allowing flexible interaction through visual prompts; (iv) On benchmarks, we establish three representative evaluation tasks based on real clinical scenarios. Through extensive experiments, we show that OminiAbnorm-CT can significantly outperform existing methods on all the tasks and metrics.
中文摘要:本研究通过建立全面分类体系、贡献大规模标注数据集、开发可自动定位描述异常的全能异常CT模型,并设立三项临床评估任务,有效解决了CT图像自动解读难题,实验证明其性能显著优于现有方法。
English Summary: This work tackles the challenge of automated CT image interpretation by introducing a comprehensive taxonomy, a large annotated dataset, the OminiAbnorm-CT model for grounding and describing abnormalities, and three clinical benchmarks, demonstrating superior performance over existing methods.
Authors:Tianjiao Zhang, Fei Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Abstract:
This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.
中文摘要:本文提出一种利用稳定扩散生成先验解决不精确分割问题的新方法,通过分析原始图像与生成图像间的差异实现从粗到细的分割优化,综合实验验证了该方法的有效性和优越性。
English Summary: This paper introduces a novel method that leverages generative priors in Stable Diffusion to address Inexact Segmentation by analyzing discrepancies between original and generated images for coarse-to-fine refinement, demonstrating superior performance through extensive experiments.
Authors:Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai
Abstract:
Affordance theory suggests that environments inherently provide action possibilities shaping perception and behavior. While Multimodal Large Language Models (MLLMs) achieve strong performance in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce **A4Bench**, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance, assessing understanding of inherent object properties through 1,282 questionanswer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. We evaluate 17 MLLMs (nine proprietary and eight open-source) and compare them to human performance. Results show that proprietary models generally outperform open-source ones, yet all models perform far below humans, especially in transformative affordance. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions.
中文摘要:本研究提出A4Bench基准测试,评估多模态大语言模型的可供性感知能力,发现所有模型表现远逊于人类,尤其在转化型可供性方面,揭示了AI系统环境理解能力的重大缺陷。
English Summary: This study introduces A4Bench, a benchmark evaluating multimodal large language models' affordance perception, revealing they significantly underperform humans, especially in transformative scenarios, highlighting critical gaps in environmental understanding.
Authors:Farong Wen, Yijin Guo, Junying Wang, Jiaohao Xiao, Yingjie Zhou, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Abstract:
The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.
中文: 本文提出的MLLM Interview (MITV)策略通过构建带难度标注的访谈数据集和渐进式测试方法,能用少量问题快速评估多模态大语言模型的性能极限。
English: The MLLM Interview (MITV) strategy is proposed to efficiently evaluate Multimodal Large Language Models by testing fewer questions with difficulty labels, enabling faster performance assessment while maintaining accuracy.
Authors:Kento Kawaharazuka, Takahiro Hattori, Keita Yoneda, Kei Okada
Abstract:
Musculoskeletal humanoids are robots that closely mimic the human musculoskeletal system, offering various advantages such as variable stiffness control, redundancy, and flexibility. However, their body structure is complex, and muscle paths often significantly deviate from geometric models. To address this, numerous studies have been conducted to learn body schema, particularly the relationships among joint angles, muscle tension, and muscle length. These studies typically rely solely on data collected from the actual robot, but this data collection process is labor-intensive, and learning becomes difficult when the amount of data is limited. Therefore, in this study, we propose a method that applies the concept of Physics-Informed Neural Networks (PINNs) to the learning of body schema in musculoskeletal humanoids, enabling high-accuracy learning even with a small amount of data. By utilizing not only data obtained from the actual robot but also the physical laws governing the relationship between torque and muscle tension under the assumption of correct joint structure, more efficient learning becomes possible. We apply the proposed method to both simulation and an actual musculoskeletal humanoid and discuss its effectiveness and characteristics.
中文: 本研究提出一种物理信息神经网络方法,通过结合实际机器人数据与扭矩-肌肉张力的物理规律,能够在少量数据下高效学习肌骨仿人机器人的身体模式,实现关节与肌肉关系的高精度建模。
English: This study introduces a Physics-Informed Neural Network approach to efficiently learn body schema for musculoskeletal humanoids, enabling high-accuracy modeling of joint-muscle relationships with minimal data by incorporating physical laws alongside robot-collected measurements.
Authors:Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu
Abstract:
Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.
中文: 本研究揭示强化学习中负样本对泛化性和鲁棒性的关键作用,发现群体相对策略优化的数据低效问题并提出改进方案,同时将模型不稳定性归因于结果模糊的未确定问题,建议通过多次评估缓解该问题。
English: This study reveals that negative samples in reinforcement learning significantly boost generalization and robustness, identifies data inefficiency in group relative policy optimization with solutions to enhance reasoning, and attributes model instability to uncertain problems while recommending multiple evaluations for mitigation.
Authors:Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Abstract:
Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
Chinese Summary: 传统奖励模型依赖固定数据集,难以适应多样任务需求,而新型RewardAnything模型通过动态遵循自然语言原则,无需重新训练即可实现卓越的泛化能力。
English Summary: Reward Models traditionally rely on fixed datasets, limiting their adaptability to varying real-world tasks, but the new RewardAnything model dynamically follows natural language principles, achieving superior generalization without retraining.
Authors:Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Abstract:
Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about $6,000$ high-quality long stories, with an average length of $8,000$ words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.
中文:StoryWriter是一个多智能体框架,通过提纲、规划和写作代理解决生成长篇故事时的连贯性和叙事复杂性难题,在质量和长度上显著优于现有方法。
English: StoryWriter is a multi-agent framework designed to overcome challenges in long story generation by employing outline, planning, and writing agents to ensure coherence and narrative complexity, significantly outperforming existing methods in quality and length.
Authors:Zhuo Chen, Jialing He, Jiacheng Wang, Zehui Xiong, Tao Xiang, Liehuang Zhu, Dusit Niyato
Abstract:
Blockchain-based steganography enables data hiding via encoding the covert data into a specific blockchain transaction field. However, previous works focus on the specific field-embedding methods while lacking a consideration on required field-generation embedding. In this paper, we propose a generic blockchain-based steganography framework (GBSF). The sender generates the required fields such as amount and fees, where the additional covert data is embedded to enhance the channel capacity. Based on GBSF, we design a reversible generative adversarial network (R-GAN) that utilizes the generative adversarial network with a reversible generator to generate the required fields and encode additional covert data into the input noise of the reversible generator. We then explore the performance flaw of R-GAN. To further improve the performance, we propose R-GAN with Counter-intuitive data preprocessing and Custom activation functions, namely CCR-GAN. The counter-intuitive data preprocessing (CIDP) mechanism is used to reduce decoding errors in covert data, while it incurs gradient explosion for model convergence. The custom activation function named ClipSigmoid is devised to overcome the problem. Theoretical justification for CIDP and ClipSigmoid is also provided. We also develop a mechanism named T2C, which balances capacity and concealment. We conduct experiments using the transaction amount of the Bitcoin mainnet as the required field to verify the feasibility. We then apply the proposed schemes to other transaction fields and blockchains to demonstrate the scalability. Finally, we evaluate capacity and concealment for various blockchains and transaction fields and explore the trade-off between capacity and concealment. The results demonstrate that R-GAN and CCR-GAN are able to enhance the channel capacity effectively and outperform state-of-the-art works.
中文摘要:本文提出了一种通用区块链隐写框架及其改进版本,通过生成式对抗网络和定制化预处理机制,在保持隐蔽性的同时有效提升了区块链交易中的数据隐藏容量,性能优于现有方法。
English Summary: This paper introduces a generic blockchain-based steganography framework (GBSF) and its enhanced version CCR-GAN, which effectively increase data hiding capacity in blockchain transactions while maintaining concealment, outperforming existing methods.
Authors:Sha Zhang, Suorong Yang, Tong Xie, Xiangyuan Xue, Zixuan Hu, Rui Li, Wenxi Qu, Zhenfei Yin, Tianfan Fu, Di Hu, Andres M Bran, Nian Ran, Bram Hoex, Wangmeng Zuo, Philippe Schwaller, Wanli Ouyang, Lei Bai, Yanyong Zhang, Lingyu Duan, Shixiang Tang, Dongzhan Zhou
Abstract:
Scientific discovery has long been constrained by human limitations in expertise, physical capability, and sleep cycles. The recent rise of AI scientists and automated laboratories has accelerated both the cognitive and operational aspects of research. However, key limitations persist: AI systems are often confined to virtual environments, while automated laboratories lack the flexibility and autonomy to adaptively test new hypotheses in the physical world. Recent advances in embodied AI, such as generalist robot foundation models, diffusion-based action policies, fine-grained manipulation learning, and sim-to-real transfer, highlight the promise of integrating cognitive and embodied intelligence. This convergence opens the door to closed-loop systems that support iterative, autonomous experimentation and the possibility of serendipitous discovery. In this position paper, we propose the paradigm of Intelligent Science Laboratories (ISLs): a multi-layered, closed-loop framework that deeply integrates cognitive and embodied intelligence. ISLs unify foundation models for scientific reasoning, agent-based workflow orchestration, and embodied agents for robust physical experimentation. We argue that such systems are essential for overcoming the current limitations of scientific discovery and for realizing the full transformative potential of AI-driven science.
Chinese: 智能科学实验室(ISL)的提出融合了认知与具身智能,构建闭环系统以实现自主、自适应的物理实验,从而突破当前科学发现的瓶颈。
English: The emergence of Intelligent Science Laboratories (ISLs) integrates cognitive and embodied AI to create closed-loop systems that enable autonomous, adaptive physical experimentation, overcoming current limitations in scientific discovery.
Authors:Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma
Abstract:
Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagrams. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative model-in-the-loop process. The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current state-of-the-art models encounter substantial challenges in physics reasoning. For example, GPT-4o mini achieves only about 34.2% accuracy in the proposed PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding. The benchmark and evaluation scripts are available at https://prismax-team.github.io/PhysUniBenchmark/.
中文: PhysUniBench作为综合性多模态基准测试,揭示了现有AI模型在物理推理上的明显不足——即便是GPT-4o mini模型在其3304道本科物理题中也仅达34.2%的正确率,凸显了提升多模态理解能力的迫切需求。
English: PhysUniBench is a comprehensive multimodal benchmark for evaluating AI models' physics reasoning, revealing significant limitations in current systems as even top models like GPT-4o mini achieve only 34.2% accuracy on its 3,304 undergraduate-level problems.
Authors:Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang
Abstract:
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
Chinese: 我们提出了原生分辨率图像合成,这是一种利用原生分辨率扩散变换器(NiT)生成任意分辨率和宽高比图像的新范式,实现了最先进的性能及超越训练数据的零样本泛化能力。
English: We introduce native-resolution image synthesis, a novel paradigm using the Native-resolution diffusion Transformer (NiT) to generate images at any resolution and aspect ratio, achieving state-of-the-art performance and zero-shot generalization beyond training data.
Authors:Jiyao Wei, Saiping Guan, Da Li, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Abstract:
N-ary Knowledge Graphs (NKGs) are a specialized type of knowledge graph designed to efficiently represent complex real-world facts. Unlike traditional knowledge graphs, where a fact typically involves two entities, NKGs can capture n-ary facts containing more than two entities. Link prediction in NKGs aims to predict missing elements within these n-ary facts, which is essential for completing NKGs and improving the performance of downstream applications. This task has recently gained significant attention. In this paper, we present the first comprehensive survey of link prediction in NKGs, providing an overview of the field, systematically categorizing existing methods, and analyzing their performance and application scenarios. We also outline promising directions for future research.
中文: 本文首次对N元知识图谱中的链接预测进行全面综述,系统分类现有方法并展望未来研究方向,旨在完善NKG结构并提升下游应用性能。
English: This paper offers the first comprehensive survey on link prediction in N-ary Knowledge Graphs, systematically categorizing methods and outlining future research directions to enhance NKG completion and application performance.
Authors:Zixuan Li, Wenxuan Liu, Long Bai, Chunmao Zhang, Wei Li, Fenghui Zhang, Quanxin Jin, Ruoyun He, Zhuo Chen, Zhilei Hu, Fei Wang, Bingbing Xu, Xuhui Jiang, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Abstract:
Deep knowledge analysis tasks always involve the systematic extraction and association of knowledge from large volumes of data, followed by logical reasoning to discover insights. However, to solve such complex tasks, existing deep research frameworks face three major challenges: 1) They lack systematic organization and management of knowledge; 2) They operate purely online, making it inefficient for tasks that rely on shared and large-scale knowledge; 3) They cannot perform complex knowledge computation, limiting their abilities to produce insightful analytical results. Motivated by these, in this paper, we propose a \textbf{K}nowledgeable \textbf{D}eep \textbf{R}esearch (\textbf{KDR}) framework that empowers deep research with deep knowledge analysis capability. Specifically, it introduces an independent knowledge organization phase to preprocess large-scale, domain-relevant data into systematic knowledge offline. Based on this knowledge, it extends deep research with an additional kind of reasoning steps that perform complex knowledge computation in an online manner. To enhance the abilities of LLMs to solve knowledge analysis tasks in the above framework, we further introduce \textbf{\KCII}, an LLM that bridges knowledge organization and reasoning via unified code generation. For knowledge organization, it generates instantiation code for predefined classes, transforming data into knowledge objects. For knowledge computation, it generates analysis code and executes on the above knowledge objects to obtain deep analysis results. Experimental results on more than thirty datasets across six knowledge analysis tasks demonstrate the effectiveness of \KCII. Moreover, when integrated into the KDR framework, \KCII can generate high-quality reports with insightful analytical results compared to the mainstream deep research framework.
中文摘要:本文提出知识化深度研究(KDR)框架,通过离线知识组织与在线复杂计算解决现有系统缺陷,并引入KCII大语言模型以统一代码生成桥接知识处理环节,从而生成具有深度洞察力的分析报告。
English Summary: The paper introduces a Knowledgeable Deep Research (KDR) framework that addresses limitations in existing systems by organizing knowledge offline and enabling complex computations, enhanced by the KCII LLM which unifies knowledge processing through code generation to produce insightful analytical results.
Authors:Long Bai, Zixuan Li, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng, Tat-Seng Chua
Abstract:
Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models' generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.
中文总结:本文提出了一种从通用到特定(G2S)的学习框架,通过将时序知识图谱中的通用模式与场景信息解耦学习,显著提升了大型语言模型在时序知识图谱预测任务中的泛化能力。
English Summary: This paper introduces a General-to-Specific (G2S) learning framework that separates the learning of general temporal patterns from scenario-specific information in temporal knowledge graph forecasting, effectively enhancing large language models' generalization capabilities.
Authors:Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Abstract:
Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, and (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an autonomous driving system that integrates VLMs with diffusion planner, which adopts a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering datasets to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning with NAVSIM non-reactive simulator, enabling the model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 5.6 PDMS.
Chinese: ReCogDrive提出了一种强化认知框架,通过将视觉语言模型与扩散规划器相结合,统一驾驶理解与规划,解决了语言-动作不匹配问题,并通过优化轨迹生成提升安全性,在自动驾驶中实现了最先进的性能。
English: ReCogDrive introduces a reinforced cognitive framework that integrates vision-language models with a diffusion planner to unify driving understanding and planning, achieving state-of-the-art performance in autonomous driving by addressing language-action mismatches and enhancing safety through optimized trajectory generation.
Authors:Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Abstract:
Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. All code, model weights, and datasets will be made publicly available to facilitate subsequent research.
Chinese: ReCogDrive提出了一种强化认知框架,通过将视觉语言模型与扩散规划器相结合,统一驾驶理解与规划,解决了语言-动作不匹配问题,并通过优化轨迹生成提升安全性,在自动驾驶中实现了最先进的性能。
English: ReCogDrive introduces a reinforced cognitive framework that integrates vision-language models with a diffusion planner to unify driving understanding and planning, achieving state-of-the-art performance in autonomous driving by addressing language-action mismatches and enhancing safety through optimized trajectory generation.
Authors:Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Abstract:
We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
Chinese: Genesis是一个统一框架,能够联合生成具有时空和跨模态一致性的多视角驾驶视频与激光雷达序列,在基准测试中达到最优性能,并有效提升了分割和3D检测等下游任务的效果。
English: Genesis is a unified framework that jointly generates multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency, achieving state-of-the-art performance on benchmarks and enhancing downstream tasks like segmentation and 3D detection.
Authors:Peng Shu, Junhao Chen, Zhengliang Liu, Huaqin Zhao, Xinliang Li, Tianming Liu
Abstract:
The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across U.S. universities, benchmarking their capabilities against Department of Energy (DOE) leadership-class systems and industrial AI infrastructures. We examine over 50 premier research institutions, analyzing compute capacity, architectural design, governance models, and energy efficiency. Our findings reveal that university clusters, though vital for academic research, exhibit significantly lower growth trajectories (CAGR $\approx$ 18%) than their national ($\approx$ 43%) and industrial ($\approx$ 78%) counterparts. The increasing skew toward GPU-dense AI workloads has widened the capability gap, highlighting the need for federated computing, idle-GPU harvesting, and cost-sharing models. We also identify emerging paradigms, such as decentralized reinforcement learning, as promising opportunities for democratizing AI training within campus environments. Ultimately, this work provides actionable insights for academic leaders, funding agencies, and technology partners to ensure more equitable and sustainable HPC access in support of national research priorities.
中文: 本调查评估了美国高校的高性能计算能力,发现由于GPU密集型人工智能需求,其增长速度远低于国家和工业系统,差距持续扩大,并提出联合计算和成本分摊等解决方案以实现公平访问。
English: This survey assesses the HPC capabilities at U.S. universities, revealing their slower growth and widening gap compared to national and industrial systems due to GPU-intensive AI demands, while proposing solutions like federated computing and cost-sharing for equitable access.
Authors:Dingzirui Wang, Xuanliang Zhang, Rongyu Cao, Longxu Dou, Xianzhen Luo, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li
Abstract:
Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.
Chinese: Format-Adapter通过生成和选择适合任务的推理格式来最小化错误,在数学和常识推理任务中平均性能提升4.3%,有效改进了大语言模型的推理能力。
English: Format-Adapter enhances LLM reasoning by generating and selecting task-specific reasoning formats to minimize errors, achieving a 4.3% average performance improvement in math and commonsense tasks.
Authors:Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai
Abstract:
Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing multi-agent systems constructed from a single base LLM mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to discrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process. We propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning.
中文: 针对多智能体系统中自然语言通信导致信息损失的问题,我们提出状态差分编码协议,通过同时传递语言标记和状态转移轨迹,在复杂推理任务中实现了最优性能。
English: Multi-agent systems using natural language for communication often suffer from information loss, so we propose a State Delta Encoding protocol that transfers both tokens and state transition trajectories, achieving state-of-the-art performance in complex reasoning tasks.
Authors:Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Reasoning Models to tackle complex tasks. However, current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if Large Reasoning Models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning (MeRF), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving ``telling LLMs rules of the game''. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that MeRF achieves substantial performance gains over RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.
Chinese: 本文提出动机增强强化微调(MeRF)方法,通过将奖励函数描述融入提示来改进大型推理模型的强化微调,利用上下文学习能力使模型生成与优化目标对齐,相比现有方法实现了显著性能提升。
English: The paper introduces Motivation-enhanced Reinforcement Finetuning (MeRF), a method that improves reinforcement fine-tuning of large reasoning models by incorporating reward function descriptions into prompts, leveraging in-context learning to align model generation with optimization objectives and achieve significant performance gains over existing approaches.
Authors:Zheli Zhou, Chenxu Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu
Abstract:
Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.
中文摘要:RecFound是一种创新的生成式表征学习框架,通过构建首个覆盖多场景的推荐基础模型综合数据集,并提出包含任务级专家混合、收敛导向调度和模型融合的多任务训练方案,有效解决了知识共享与冲突、收敛不一致等关键问题,在各种推荐任务中实现了最优性能。
English Summary: RecFound is a novel generative representational learning framework that addresses limitations in recommendation foundation models by introducing a comprehensive dataset and a multi-task training scheme with specialized components for knowledge management and convergence optimization, achieving state-of-the-art performance across diverse recommendation tasks.
Authors:Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, Weinan Zhang
Abstract:
The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.
中文: 本文提出了一个系统框架,从五个关键维度区分AI智能体与LLM聊天机器人,并对评估基准进行分类,为研究人员选择合适的评估方法以推动AI智能体发展提供指导。
English: This paper introduces a systematic framework to differentiate AI agents from LLM chatbots across five key dimensions and categorizes evaluation benchmarks to guide researchers in selecting appropriate methodologies for advancing AI agent development.
Authors:Tao Zou, Xinghua Zhang, Haiyang Yu, Minzheng Wang, Fei Huang, Yongbin Li
Abstract:
With the development and widespread application of large language models (LLMs), the new paradigm of "Model as Product" is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.
中文:EIFBENCH基准测试旨在通过多任务场景和多样化约束来评估大语言模型在复杂环境下的表现,弥补了现有单任务基准的不足,同时提出的SegPO算法旨在提升模型执行复杂工作流程的能力。
English: EIFBENCH is a benchmark designed to evaluate large language models in complex multi-task scenarios with various constraints, addressing the limitations of existing single-task benchmarks, while the SegPO algorithm is proposed to enhance model performance in executing intricate workflows.
Authors:Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu
Abstract:
Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.
中文摘要:本教程深入探讨动态RAG(在生成过程中自适应检索)和参数化RAG(将知识注入从输入层转向参数层)这两个新兴研究方向,以解决传统静态RAG系统在复杂推理任务中的局限性。
English Summary: This tutorial explores Dynamic RAG, which adaptively retrieves information during generation, and Parametric RAG, which shifts knowledge injection from input to parameter level, addressing limitations of traditional static RAG systems for complex reasoning tasks.
Authors:Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang
Abstract:
In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.
中文摘要:FEEDER框架通过预选少量但具有代表性的演示样本,有效降低了上下文学习的计算成本,同时在不同规模的大语言模型中保持性能表现。
English Summary: The FEEDER framework efficiently pre-selects a small yet representative subset of demonstrations for in-context learning, reducing computational costs while maintaining performance across various large language models.
Authors:Thai Hoang, Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
Abstract:
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3\% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.
中文: LAM SIMULATOR框架让AI智能体通过实时反馈自主探索任务,生成高质量训练数据,在最少人工干预下将性能提升高达49.3%。
English: LAM SIMULATOR is a framework that enables AI agents to autonomously explore tasks with real-time feedback, generating high-quality training data that boosts performance by up to 49.3% with minimal human input.
Authors:Hang Su, Jun Luo, Chang Liu, Xiao Yang, Yichi Zhang, Yinpeng Dong, Jun Zhu
Abstract:
Recent advances in large language models (LLMs) have catalyzed the rise of autonomous AI agents capable of perceiving, reasoning, and acting in dynamic, open-ended environments. These large-model agents mark a paradigm shift from static inference systems to interactive, memory-augmented entities. While these capabilities significantly expand the functional scope of AI, they also introduce qualitatively novel security risks - such as memory poisoning, tool misuse, reward hacking, and emergent misalignment - that extend beyond the threat models of conventional systems or standalone LLMs. In this survey, we first examine the structural foundations and key capabilities that underpin increasing levels of agent autonomy, including long-term memory retention, modular tool use, recursive planning, and reflective reasoning. We then analyze the corresponding security vulnerabilities across the agent stack, identifying failure modes such as deferred decision hazards, irreversible tool chains, and deceptive behaviors arising from internal state drift or value misalignment. These risks are traced to architectural fragilities that emerge across perception, cognition, memory, and action modules. To address these challenges, we systematically review recent defense strategies deployed at different autonomy layers, including input sanitization, memory lifecycle control, constrained decision-making, structured tool invocation, and introspective reflection. We introduce the Reflective Risk-Aware Agent Architecture (R2A2), a unified cognitive framework grounded in Constrained Markov Decision Processes (CMDPs), which incorporates risk-aware world modeling, meta-policy adaptation, and joint reward-risk optimization to enable principled, proactive safety across the agent's decision-making loop.
中文摘要:大型语言模型的最新进展催生了具备感知、推理和行动能力的自主AI智能体,但这些智能体也带来了记忆污染、工具滥用等新型安全风险,需通过防御策略和风险感知架构加以应对。
English Summary: Recent advances in large language models have enabled autonomous AI agents with enhanced capabilities, but also introduced novel security risks like memory poisoning and tool misuse, which are addressed through defense strategies and a proposed risk-aware architecture.
Authors:Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Wenbo Hu, Jiayang Liu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong
Abstract:
Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.
中文: Trust-videoLLMs基准首次全面评估23个视频大语言模型在五个可信维度上的表现,揭示了动态场景理解与抗干扰能力的不足,并强调需增强训练数据多样性和多模态对齐。
English: The Trust-videoLLMs benchmark evaluates 23 video large language models across five trustworthiness dimensions, revealing limitations in dynamic scene comprehension and resilience while highlighting the need for improved training data and multimodal alignment.
Authors:Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, Jun Zhu
Abstract:
The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs' actionable outputs, long-horizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments.
中文: 多模态大语言模型代理在提升数字环境自主能力的同时,引发了严重的可信赖性风险,为此提出的MLA-Trust框架揭示了交互场景中在真实性、可控性、安全性和隐私方面的关键漏洞。
English: Multimodal LLM-based agents (MLAs) enhance autonomous capabilities in digital environments but introduce severe trustworthiness risks, leading to the development of MLA-Trust, a framework that uncovers critical vulnerabilities in interactive scenarios across truthfulness, controllability, safety, and privacy.
Authors:Weijie Yuan, Yuanhao Cui, Jiacheng Wang, Fan Liu, Geng Sun, Tao Xiang, Jie Xu, Shi Jin, Dusit Niyato, Sinem Coleri, Sumei Sun, Shiwen Mao, Abbas Jamalipour, Dong In Kim, Mohamed-Slim Alouini, Xuemin Shen
Abstract:
In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace.
中文: 本文介绍了一种新型低空无线网络(LAWN),其采用可重构的三维分层架构,集成了连接、感知、控制与计算功能,适用于动态复杂环境,并探讨了相关支撑技术与未来研究方向。
English: This article presents a novel low-altitude wireless network (LAWN) with a reconfigurable 3D layered architecture that integrates connectivity, sensing, control, and computing for dynamic environments, discussing enabling technologies and future research directions.
Authors:Changyuan Zhao, Ruichen Zhang, Jiacheng Wang, Gaosheng Zhao, Dusit Niyato, Geng Sun, Shiwen Mao, Dong In Kim
Abstract:
World models are emerging as a transformative paradigm in artificial intelligence, enabling agents to construct internal representations of their environments for predictive reasoning, planning, and decision-making. By learning latent dynamics, world models provide a sample-efficient framework that is especially valuable in data-constrained or safety-critical scenarios. In this paper, we present a comprehensive overview of world models, highlighting their architecture, training paradigms, and applications across prediction, generation, planning, and causal reasoning. We compare and distinguish world models from related concepts such as digital twins, the metaverse, and foundation models, clarifying their unique role as embedded cognitive engines for autonomous agents. We further propose Wireless Dreamer, a novel world model-based reinforcement learning framework tailored for wireless edge intelligence optimization, particularly in low-altitude wireless networks (LAWNs). Through a weather-aware UAV trajectory planning case study, we demonstrate the effectiveness of our framework in improving learning efficiency and decision quality.
中文: 世界模型正成为人工智能领域的变革性范式,使智能体能够构建环境内部表征以实现高效预测与决策,所提出的Wireless Dreamer框架通过无人机案例研究在无线网络优化中展现出卓越性能。
English: World models are emerging as a transformative AI paradigm that enables agents to build internal environmental representations for efficient prediction and decision-making, with the proposed Wireless Dreamer framework demonstrating superior performance in wireless network optimization through a UAV case study.
Authors:Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen, Jun Yang, Haojie Zhang, Jiayi Li, Jiayi Wu, Yiqun Chen, Changle Qu, Keyi Kong, Wenwen Ye, Lixin Su, Xinyu Ma, Long Xia, Daiting Shi, Jiashu Zhao, Haoyi Xiong, Shuaiqiang Wang, Dawei Yin
Abstract:
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.
中文: 本文提出AI搜索范式,通过四个大语言模型智能体的模块化架构动态协作处理各类查询,并系统阐述了实现可信赖、可扩展智能搜索系统的关键技术方法。
English: This paper presents the AI Search Paradigm, a modular framework using four LLM-powered agents that collaboratively handle diverse queries through dynamic workflows, while detailing methodologies for building trustworthy and scalable search systems.
Authors:Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, Yueqi Duan
Abstract:
The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld
中文摘要:视频生成技术的飞速发展加大了对AI生成视频检测器的需求,为此我们提出了高质量真实世界模拟数据集GenWorld和利用多视角一致性的SpannDetector检测模型,显著提升了检测性能。
English Summary: The rapid advancement of video generation technology has increased the need for effective AI-generated video detectors, leading to the creation of GenWorld, a high-quality real-world simulation dataset, and SpannDetector, a model that uses multi-view consistency to improve detection accuracy.
Authors:Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu
Abstract:
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
中文摘要:SpectralAR框架通过谱视角实现视觉序列的因果性,采用嵌套谱标记化将图像转化为有序频谱标记进行自回归生成,以少量标记和参数取得了优异性能。
English Summary: The SpectralAR framework introduces a novel autoregressive visual generation method using spectral tokens to ensure causality and efficiency, achieving state-of-the-art results with minimal tokens and parameters.
Authors:Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
Abstract:
Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.
中文: 本文全面综述了视觉通用模型,探讨了其框架设计、性能提升技术及实际应用场景,同时分析了当前挑战并展望了未来研究方向。
English: This paper provides a comprehensive overview of vision generalist models, exploring their frameworks, performance enhancement techniques, and real-world applications while addressing current challenges and future research directions.
Authors:Xingzhu Wang, Erhan Zhang, Yiqun Chen, Jinghan Xuan, Yucheng Hou, Yitong Xu, Ying Nie, Shuaiqiang Wang, Dawei Yin, Jiaxin Mao
Abstract:
The conventional Cranfield paradigm struggles to effectively capture user satisfaction due to its weak correlation between relevance and satisfaction, alongside the high costs of relevance annotation in building test collections. To tackle these issues, our research explores the potential of leveraging large language models (LLMs) to generate multilevel usefulness labels for evaluation. We introduce a new user-centric evaluation framework that integrates users' search context and behavioral data into LLMs. This framework uses a cascading judgment structure designed for multilevel usefulness assessments, drawing inspiration from ordinal regression techniques. Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness, allowing our approach to surpass third-party labeling methods. Furthermore, we conduct ablation studies to investigate the influence of key components within the framework. We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
Chinese: 本研究提出了一种以用户为中心的评价框架,利用大语言模型结合搜索上下文和行为数据生成多级有用性标签,其准确性优于第三方标注方法,并显著提升了用户满意度预测的性能。
English: The study introduces a user-centric evaluation framework that uses large language models to generate multilevel usefulness labels by incorporating search context and behavioral data, achieving superior accuracy over third-party methods and significantly enhancing user satisfaction prediction.
Authors:Yanbo Wang, Ziyi Wang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Abstract:
Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.
中文摘要:OGGSplat提出了一种开放高斯生长方法,利用语义属性进行图像外推,并通过扩散模型的双向控制实现从稀疏视图的语义感知三维场景重建,有效克服了现有方法的局限性。
English Summary: OGGSplat introduces an open Gaussian growing method that uses semantic attributes for image extrapolation and bidirectional control between diffusion models to achieve semantic-aware 3D scene reconstruction from sparse views, addressing limitations of existing approaches.
Authors:Linxuan He, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng
Abstract:
Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. Instead, empirical safety evaluation is employed as an alternative, but the absence of provable guarantees imposes significant limitations. To address these issues, we argue for a paradigm shift to provable probabilistic safety that integrates provable guarantees with progressive achievement toward a probabilistic safety boundary on overall system performance. The new paradigm better leverages statistical methods to enhance feasibility and scalability, and a well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale. In this Perspective, we outline a roadmap for provable probabilistic safety, along with corresponding challenges and potential solutions. By bridging the gap between theoretical safety assurance and practical deployment, this Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
中文: 本文主张具身AI系统的安全验证应从确定性转向可证明的概率性安全,通过结合统计方法与明确的安全边界,为关键领域的大规模应用提供可行路径。
English: This abstract advocates for a shift from deterministic to provable probabilistic safety in embodied AI systems, proposing a framework that combines statistical methods with defined safety boundaries to enable scalable deployment in critical applications.
Authors:Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang
Abstract:
Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment compared to existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9\% on Object-HalBench and 49.8\% on MM-HalBench.
Chinese: EMPO作为一种新颖方法,通过自动构建多模态偏好数据增强模态对齐,显著降低大型视觉语言模型的幻觉现象,在多个基准测试中大幅降低了幻觉率。
English: EMPO is a novel method that enhances modality alignment and reduces hallucinations in Large Visual Language Models by automatically constructing multimodal preference data, achieving significant reductions in hallucination rates on benchmarks.
Authors:Lu Wang, Di Zhang, Fangkai Yang, Pu Zhao, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang
Abstract:
User profiling is pivotal for recommendation systems, as it transforms raw user interaction data into concise and structured representations that drive personalized recommendations. While traditional embedding-based profiles lack interpretability and adaptability, recent advances with large language models (LLMs) enable text-based profiles that are semantically richer and more transparent. However, existing methods often adhere to fixed formats that limit their ability to capture the full diversity of user behaviors. In this paper, we introduce LettinGo, a novel framework for generating diverse and adaptive user profiles. By leveraging the expressive power of LLMs and incorporating direct feedback from downstream recommendation tasks, our approach avoids the rigid constraints imposed by supervised fine-tuning (SFT). Instead, we employ Direct Preference Optimization (DPO) to align the profile generator with task-specific performance, ensuring that the profiles remain adaptive and effective. LettinGo operates in three stages: (1) exploring diverse user profiles via multiple LLMs, (2) evaluating profile quality based on their impact in recommendation systems, and (3) aligning the profile generation through pairwise preference data derived from task performance. Experimental results demonstrate that our framework significantly enhances recommendation accuracy, flexibility, and contextual awareness. This work enhances profile generation as a key innovation for next-generation recommendation systems.
中文摘要:LettinGo提出了一种创新框架,利用大语言模型和直接偏好优化技术,通过探索多样用户画像、评估推荐系统影响及基于任务表现对齐生成的三阶段流程,有效提升推荐系统的准确性、灵活性和上下文感知能力。
English Summary: LettinGo introduces a novel framework using LLMs and Direct Preference Optimization to create diverse, adaptive user profiles that significantly improve recommendation accuracy and flexibility by aligning profile generation with task performance through a three-stage process.
Authors:Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao
Abstract:
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
中文:GUI-Actor提出了一种无需坐标的视觉定位方法,通过基于注意力的动作头和验证器在多个GUI基准测试中超越现有模型,仅需微调少量参数即可赋予视觉语言模型精准的定位能力。
English: GUI-Actor introduces a coordinate-free visual grounding method using an attention-based action head and verifier, outperforming prior models on GUI benchmarks while enabling efficient fine-tuning without compromising the VLM's general capabilities.
Authors:Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, Le Sun
Abstract:
Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration.Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% pass@1.Interestingly, we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.
中文摘要:本文提出了首个嵌入式系统开发综合基准EmbedAgent和Embedbench,通过评估主流大语言模型发现其在嵌入式任务中的性能局限,并提出的增强策略显著提升了模型表现。
English Summary: This paper introduces EmbedAgent and Embedbench, the first comprehensive benchmark for evaluating LLMs in embedded system development, revealing performance gaps and proposing enhancement strategies that significantly improve results.
Authors:Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, Xianpei Han
Abstract:
Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.
中文: 现有指令合成方法在多轮对话中常缺乏跨轮次连贯性,因此我们提出了一种骨架引导的多轮对话生成框架,通过建模对话意图并生成结构化骨架来确保连贯且目标导向的对话,从而显著提升了对话一致性和任务完成率。
English: Existing instruction synthesis methods often fail to maintain cross-turn coherence in multi-turn dialogues, so we propose a Skeleton-Guided Multi-Turn Dialogue Generation framework that models conversational intent and generates structured skeletons to ensure coherent and goal-oriented conversations, resulting in significantly improved consistency and task success rates.
Authors:Qiming Zhu, Jialun Cao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung
Abstract:
Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) mainly focuses on single-language settings, leaving cross-lingual effectiveness and security unexplored. Multi-lingual RACG systems are valuable for migrating code-bases across programming languages (PLs), yet face risks from error (e.g. adversarial data corruption) propagation in cross-lingual transfer. We construct a dataset spanning 13 PLs with nearly 14k instances to explore utility and robustness of multi-lingual RACG systems. Our investigation reveals four key insights: (1) Effectiveness: multi-lingual RACG significantly enhances multi-lingual code LLMs generation; (2) Inequality: Java demonstrate superior cross-lingual utility over Python in RACG; (3) Robustness: Adversarial attacks degrade performance significantly in mono-lingual RACG but show mitigated impacts in cross-lingual scenarios; Counterintuitively, perturbed code may improve RACG in cross-lingual scenarios; (4) Specialization: Domain-specific code retrievers outperform significantly general text retrievers. These findings establish foundation for developing effective and secure multi-lingual code assistants.
中文摘要:当前检索增强代码生成研究主要关注单语言环境,而我们的研究发现多语言系统虽显著提升代码生成效果,却面临独特的鲁棒性挑战,尤其在不同编程语言间存在效用差异和安全风险。
English Summary: Current research on retrieval-augmented code generation primarily examines single-language environments, leaving cross-lingual effectiveness and security unaddressed, while our study reveals that multilingual systems significantly enhance code generation but face unique robustness challenges.
Authors:Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao
Abstract:
The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. The growing demand for video applications sets higher requirements for high-quality video generation models. For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. However, the existing public datasets cannot support related research and applications. In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips. \textit{ii)} statistical data filtering. \textit{iii)} model-based data purification. \textit{iv)} generation of comprehensive, structured captions. In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo.
中文: UltraVideo数据集通过提供带结构化描述的超高清4K视频填补了高质量公开视频数据的空白,同时UltraWan模型展现了更强的文本到视频生成能力。
English: The UltraVideo dataset addresses the lack of high-quality public video data by providing UHD-4K videos with structured captions, while UltraWan models demonstrate enhanced text-to-video generation capabilities.
Authors:Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan
Abstract:
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
中文: 为克服以文本为中心的多模态推理的局限,我们提出了一种新范式,使大型视觉语言模型能通过基本绘图操作进行空间推理,在多种基准测试中实现了显著的性能提升。
English: To overcome the limitations of text-centric multimodal reasoning, we introduce a novel paradigm that enables large vision-language models to perform spatial reasoning through elementary drawing operations, achieving significant performance gains across diverse benchmarks.
Authors:Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, Tieniu Tan
Abstract:
Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model's advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.
中文摘要:该研究提出VidBridge-R1训练框架,通过设计中间代理任务解决视频问答与描述任务间的性能冲突,首次在单一模型中实现了对两种任务的有效兼顾与性能提升。
English Summary: The study introduces VidBridge-R1, a novel training framework using proxy tasks to resolve performance conflicts between question answering and captioning in video reasoning models, achieving superior results in both tasks within a single model.
Authors:Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Weihong Lin, Zekun Wang, Bohan Zeng, Yang Shi, Sihan Yang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan
Abstract:
The "Reason-Then-Respond" paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. These proxy tasks compel the model to simultaneously develop both holistic, divergent understanding and precise, convergent reasoning capabilities. Embodying this framework, we present VidBridge-R1, the first versatile video reasoning model that effectively bridges the paradigm conflict. Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model, demonstrating the efficacy of our approach in fostering more generalizable and powerful video understanding models.
中文摘要:该研究提出VidBridge-R1训练框架,通过设计中间代理任务解决视频问答与描述任务间的性能冲突,首次在单一模型中实现了对两种任务的有效兼顾与性能提升。
English Summary: The study introduces VidBridge-R1, a novel training framework using proxy tasks to resolve performance conflicts between question answering and captioning in video reasoning models, achieving superior results in both tasks within a single model.
Authors:Xiaochong Lan, Jie Feng, Yizhou Sun, Chen Gao, Jiahuan Lei, Xinlei Shi, Hengliang Luo, Yong Li
Abstract:
Living needs are the needs people generate in their daily lives for survival and well-being. On life service platforms like Meituan, user purchases are driven by living needs, making accurate living need predictions crucial for personalized service recommendations. Traditional approaches treat this prediction as a closed-set classification problem, severely limiting their ability to capture the diversity and complexity of living needs. In this work, we redefine living need prediction as an open-set classification problem and propose PIGEON, a novel system leveraging large language models (LLMs) for unrestricted need prediction. PIGEON first employs a behavior-aware record retriever to help LLMs understand user preferences, then incorporates Maslow's hierarchy of needs to align predictions with human living needs. For evaluation and application, we design a recall module based on a fine-tuned text embedding model that links flexible need descriptions to appropriate life services. Extensive experiments on real-world datasets demonstrate that PIGEON significantly outperforms closed-set approaches on need-based life service recall by an average of 19.37%. Human evaluation validates the reasonableness and specificity of our predictions. Additionally, we employ instruction tuning to enable smaller LLMs to achieve competitive performance, supporting practical deployment.
中文: 本研究将生活需求预测重新定义为开放集分类问题,提出PIGEON系统,利用大语言模型通过理解用户偏好和马斯洛需求层次理论来提升个性化服务推荐效果,实验显示其比传统方法平均提升19.37%。
English: This study redefines living need prediction as an open-set classification problem and introduces PIGEON, a system using large language models to enhance personalized service recommendations by better capturing diverse user needs, achieving a 19.37% improvement over traditional methods.
Authors:Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Abstract:
Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP's semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.
中文: Chimera-Seg通过将分割主干与基于CLIP的语义头结合,解决了零样本语义分割中对齐空间精度与视觉语言特征的难题,并采用选择性全局蒸馏和语义对齐模块来弥合局部与全局表征之间的差距。
English: Chimera-Seg addresses zero-shot semantic segmentation challenges by integrating a segmentation backbone with a CLIP-based semantic head to align spatial precision with vision-language features, while employing selective global distillation and semantic alignment to bridge the gap between local and global representations.
Authors:Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, Theo Gevers, Martin R. Oswald, Danda Pani Paudel
Abstract:
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.
Chinese: 本文提出了首个大规模基准测试,直接在三维空间系统评估三类语言高斯溅射方法,证明了通用化方法的优势,并引入GaussianWorld-49K数据集以推动三维场景理解研究的发展。
English: This paper introduces the first large-scale benchmark to systematically evaluate three categories of Language Gaussian Splatting methods directly in 3D space, revealing the superiority of generalizable approaches and presenting GaussianWorld-49K, a curated dataset to advance 3D scene understanding research.
Authors:Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang
Abstract:
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
中文: Domain-RAG是一种无需训练的框架,通过背景检索与合成生成领域对齐的图像,有效提升了跨域少样本目标检测的性能,并在多个任务中取得了领先成果。
English: Domain-RAG is a training-free framework that enhances Cross-Domain Few-Shot Object Detection by generating domain-aligned images through background retrieval and composition, achieving state-of-the-art results across various tasks.
Authors:Jialei Chen, Xu Zheng, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Abstract:
Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.
中文摘要:提出的BiXFormer将多模态语义分割重构为掩码级分类任务,通过统一模态匹配和跨模态对齐机制充分发挥各模态优势并处理模态缺失问题,在多个基准测试中实现了显著性能提升。
English Summary: The proposed BiXFormer reformulates multi-modal semantic segmentation as a mask-level classification task, integrating Unified Modality Matching and Cross Modality Alignment to maximize modality effectiveness and handle missing modalities, achieving significant performance improvements over existing methods.
Authors:Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only 4B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework.
Chinese: EarthMind提出了一种统一的视觉语言框架,通过分层跨模态注意力整合单传感器和跨传感器地球观测数据,在其构建的FusionEO数据集和EarthMind-Bench基准测试中实现了最先进的性能。
English: EarthMind introduces a unified vision-language framework with hierarchical cross-modal attention to integrate single- and cross-sensor Earth Observation data, achieving state-of-the-art performance on benchmarks like its curated FusionEO dataset and EarthMind-Bench.
Authors:Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begüm Demir, Nicu Sebe, Paolo Rota
Abstract:
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.
Chinese: EarthMind提出了一种统一的视觉语言框架,通过分层跨模态注意力整合单传感器和跨传感器地球观测数据,在其构建的FusionEO数据集和EarthMind-Bench基准测试中实现了最先进的性能。
English: EarthMind introduces a unified vision-language framework with hierarchical cross-modal attention to integrate single- and cross-sensor Earth Observation data, achieving state-of-the-art performance on benchmarks like its curated FusionEO dataset and EarthMind-Bench.
Authors:Xue Jiang, Yihong Dong, Zheng Fang, Yingwei Ma, Tangxinyu Wang, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Yongbin Li, Ge Li
Abstract:
LLM4SE has demonstrated significant success, but LLMs' potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM unlearning show effectiveness in natural language, their applicability to source code remains underexplored. Our empirical study reveals that existing LLM unlearning approaches, when applied to source code, cause severe model utility degradation, rendering models practically unusable for code generation. In this paper, we propose PROD, a novel unlearning approach that enables LLMs to forget undesired code content while effectively preserving their code generation capabilities. PROD suppresses the probability of forget data in LLMs' output distribution while promoting candidate distributional components, enabling the model to jointly learn to forget specific content and retain its general capabilities. To facilitate this study, we establish a benchmark for code unlearning evaluation, which includes three critical downstream tasks: copyrighted code unlearning, insecure code unlearning, and deprecated API unlearning. Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. The results underscore that our approach not only extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.
中文: LLM4SE虽成效显著,但大语言模型对敏感数据的记忆可能引发法律、安全及代码质量风险,遗忘技术可解决此问题;然而现有方法应用于代码时会严重损害模型效用,为此提出PROD方法,能在遗忘不良代码内容的同时有效保持代码生成能力,并在各项任务中表现卓越。
English: LLM4SE has shown success, yet LLMs' potential memorization of sensitive data poses legal, security, and quality risks, which can be mitigated by unlearning techniques; however, existing methods degrade code generation utility, prompting the proposal of PROD, a novel approach that effectively forgets undesired code while preserving model capabilities and demonstrating superior performance across tasks.
Authors:Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
Chinese: 多模态大语言模型在视觉语言任务中取得进展,但解决现实世界中的歧义仍具挑战,为此引入MUCAR基准,专门评估多模态歧义消解能力,并揭示现有模型与人类水平间的显著差距。
English: MLLMs have advanced in vision-language tasks but struggle with real-world ambiguities, prompting the creation of MUCAR, a benchmark that evaluates multimodal ambiguity resolution and reveals significant performance gaps between models and humans.
Authors:Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. MLLMs have shown promising capability in aligning visual and textual modalities, allowing them to process image-text pairs with clear and explicit meanings. However, resolving the inherent ambiguities present in real-world language and visual contexts remains a challenge. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes first a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and second a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
Chinese: 多模态大语言模型在视觉语言任务中取得进展,但解决现实世界中的歧义仍具挑战,为此引入MUCAR基准,专门评估多模态歧义消解能力,并揭示现有模型与人类水平间的显著差距。
English: MLLMs have advanced in vision-language tasks but struggle with real-world ambiguities, prompting the creation of MUCAR, a benchmark that evaluates multimodal ambiguity resolution and reveals significant performance gaps between models and humans.
Authors:Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan
Abstract:
Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.
中文: LoRA-Gen框架通过云端模型生成LoRA参数并集成到边缘模型中,无需专门训练即可提升精度和效率。
English: The LoRA-Gen framework enhances edge-side models by generating and integrating LoRA parameters from a cloud-side model, improving both accuracy and efficiency without specialized training.
Authors:Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Abstract:
Recent advances in Large Language Models (LLMs) have enabled strong performance in long-form writing, yet existing supervised fine-tuning (SFT) approaches suffer from limitations such as data saturation and restricted learning capacity bounded by teacher signals. In this work, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a particularly critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.
中文摘要:Writing-RL框架通过自适应课程强化学习,结合三项创新组件显著提升了长文本写作能力,突破了监督微调的限制,同时意外展现出对长输入推理任务的良好泛化能力。
English Summary: The Writing-RL framework uses adaptive curriculum reinforcement learning with three novel components to significantly enhance long-form writing performance beyond supervised fine-tuning limitations, while also demonstrating unexpected generalization to long-input reasoning tasks.
Authors:Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo
Abstract:
This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.
Chinese: 本文提出了一种利用基于流的生成模型作为先验来对齐可学习潜在空间与目标分布的新框架,无需昂贵似然计算和求解常微分方程,并通过理论证明和实验验证了其有效性。
English: This paper introduces a framework that aligns learnable latent spaces with target distributions using flow-based models as priors, eliminating expensive likelihood computations and ODE solving while achieving effective alignment validated through theoretical proofs and experiments.
Authors:Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu
Abstract:
Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.
Chinese: 近期人工智能生成内容的进展显著加快了动画制作速度,为此我们推出了AnimeShooter数据集,该数据集通过分层标注和自动化流程实现了参考引导的多镜头动画生成,而基于此开发的AnimeShooterGen模型在跨镜头视觉一致性和参考遵循方面表现出色。
English: Recent AI advancements have accelerated animation production, leading to the creation of AnimeShooter, a dataset that provides reference-guided multi-shot animations with hierarchical annotations and strong visual consistency, while AnimeShooterGen establishes a baseline model demonstrating superior coherence and adherence to references.
Authors:Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Jingjing Xu, Hao Zhu, Lecheng Wang, Jia Li, Yihong Dong, Jing Mai, Bin Gu, Zhi Jin
Abstract:
While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they often struggle with complex tasks that require specific thinking paradigms, such as divide-and-conquer and procedural deduction, \etc Previous researches integrate external, reliable tools to alleviate logical inconsistencies and hallucinations in LLMs' problem-solving processes. However, we argue that the root challenge is more profound: LLMs lack the complex thinking paradigms (\ie, computational thinking) during reasoning. In this paper, we propose Computational Thinking Model (CTM), a novel framework that incorporates computational thinking paradigms into LLMs. This framework enables LLMs to reformulate complex problems through decomposition, abstraction, reduction, and simulation, among other techniques. Specifically, live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing. CTM directly instills computational thinking objectives into LLMs through tailored reinforcement learning rewards, which encourages problem simplification, modular planning, and iterative verification. We conduct extensive evaluations on multiple code generation and mathematical benchmarks. The results demonstrate that CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability. We hope this study offers valuable insights for AI reasoning, where LLMs can transform problems into robust, verifiable, and scalable computational workflows, much like computer scientists do.
中文: 计算思维模型(CTM)通过融入分解、抽象和实时代码执行等计算思维范式,显著提升了大语言模型在复杂任务中的准确性、可解释性和泛化能力。
English: The Computational Thinking Model (CTM) enhances large language models by embedding computational thinking paradigms, such as decomposition and live code execution, to improve accuracy and interpretability in solving complex tasks.
Authors:Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng
Abstract:
Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption.
In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8x speedup and reduces memory footprint by up to 47% in training large models.
中文: MPipeMoE是一个高性能库,通过自适应流水线并行和内存优化技术提升专家混合模型的训练效率,实现了显著的加速和内存占用降低。
English: MPipeMoE is a high-performance library that enhances Mixture-of-Experts training through adaptive pipeline parallelism and memory optimization, achieving significant speedup and reduced memory usage.
Authors:Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang
Abstract:
Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA
中文: SciDA基准通过采用奥林匹克级数值问题和随机初始化设置,解决了大语言模型评估中的数据污染问题,揭示了模型数值推理能力的显著下降。
English: The SciDA benchmark addresses data contamination in LLM evaluation by using Olympic-level numerical problems with randomized initializations, revealing significant performance drops in LLMs' numerical reasoning capabilities.
Authors:Yuyang Wanyan, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Jiabo Ye, Yutong Kou, Ming Yan, Fei Huang, Xiaoshan Yang, Weiming Dong, Changsheng Xu
Abstract:
In recent years, Multimodal Large Language Models (MLLMs) have been extensively utilized for multimodal reasoning tasks, including Graphical User Interface (GUI) automation. Unlike general offline multimodal tasks, GUI automation is executed in online interactive environments, necessitating step-by-step decision-making based on real-time status of the environment. This task has a lower tolerance for decision-making errors at each step, as any mistakes may cumulatively disrupt the process and potentially lead to irreversible outcomes like deletions or payments. To address these issues, we introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution, by reasoning about the potential outcome and correctness of actions. Specifically, we propose a Suggestion-aware Gradient Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance the reliability of the model's feedback. Furthermore, we develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic data. Static experiments on the GUI-Critic-Test across both mobile and web domains reveal that our GUI-Critic-R1 offers significant advantages in critic accuracy compared to current MLLMs. Dynamic evaluation on GUI automation benchmark further highlights the effectiveness and superiority of our model, as evidenced by improved success rates and operational efficiency.
中文: 本文提出一种术前评判机制,通过S-GRPO策略构建GUI-Critic-R1模型来预判操作结果以提升图形界面自动化可靠性,静态测试与动态评估均验证了其在准确率和执行效率上的显著优势。
English: This paper introduces a pre-operative critic mechanism with the S-GRPO strategy to enhance GUI automation reliability by predicting action outcomes, and demonstrates its superiority through improved accuracy and success rates in both static and dynamic evaluations.
Authors:Guangji Chen, Qingqing Wu, Shihang Lu, Meng Hua, Wen Chen
Abstract:
This paper investigates a multi-intelligent reflecting surface (IRS) aided integrated sensing and communication (ISAC) system, where multiple IRSs are strategically deployed not only to assist the communication from a multi-antenna base station (BS) to a multi-antenna communication user (CU), but also enable the sensing service for a point target in the non-line-of-sight (NLoS) region of the BS. First, we propose a hybrid multi-IRS architecture, which consists of several passive IRSs and one semi-passive IRS equipped with both active sensors and reflecting elements. To be specific, the active sensors are exploited to receive the echo signals for estimating the target's angle information, and the multiple reflecting paths provided by multi-IRS are employed to improve the degree of freedoms (DoFs) of communication. Under the given budget on the number of total IRSs elements, we theoretically show that increasing the number of deployed IRSs is beneficial for improving DoFs of spatial multiplexing for communication while increasing the Cramer-Rao bound (CRB) of target estimation, which unveils a fundamental tradeoff between the sensing and communication performance. To characterize the rate-CRB tradeoff, we study a rate maximization problem, by optimizing the BS transmit covariance matrix, IRSs phase-shifts, and the number of deployed IRSs, subject to a maximum CRB constraint. Analytical results reveal that the communication-oriented design becomes optimal when the total number of IRSs elements exceeds a certain threshold, wherein the relationships of the rate and CRB with the number of IRS elements/sensors, transmit power, and the number of deployed IRSs are theoretically derived and demystified. Simulation results validate our theoretical findings and also demonstrate the superiority of our proposed designs over the benchmark schemes.
中文摘要:本文研究了一种多智能反射面辅助的集成感知与通信系统,揭示了增加反射面部署在提升通信自由度与提高目标估计误差下界之间的基本权衡关系。
English Summary: This paper explores a multi-intelligent reflecting surface (IRS) system that enhances both communication and sensing capabilities, revealing a fundamental tradeoff where increasing IRS deployment improves communication degrees of freedom but raises target estimation error bounds.
Authors:Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Zhiyuan Peng, Peiran Yang, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen
Abstract:
Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors that influence their effectiveness. To begin with, we reviewed existing methods and observed that many of them mutate DL models (e.g., changing their parameters) without any customization, ignoring the unique challenges in framework testing. Another issue with these methods is their limited effectiveness, characterized by a high rate of false positives caused by illegal mutations arising from the use of generic, non-customized mutation operators. Moreover, we tracked the defects identified by these methods and discovered that most of them were ignored by developers. Motivated by these observations, we investigate the effectiveness of existing mutation-based testing methods in detecting important defects that have been authenticated by framework developers. We begin by collecting defect reports from three popular frameworks and classifying them based on framework developers' ratings to build a comprehensive dataset. We then perform an in-depth analysis to uncover valuable insights. Based on our findings, we propose optimization strategies to address the shortcomings of existing approaches. Following these optimizations, we identified seven new defects, four of which were confirmed by developers as high-priority issues, with three resolved. In summary, we identified 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
中文: 本研究评估了现有基于变异的测试方法在检测深度学习框架缺陷方面的有效性,揭示了其高误报率和开发者忽视等局限性,并提出优化策略,成功发现39个独特缺陷且获得开发者高度确认。
English: This study evaluates the effectiveness of existing mutation-based testing methods for detecting defects in deep learning frameworks, identifies their limitations such as high false positives and developer neglect, and proposes optimizations that successfully uncovered 39 unique defects with high developer confirmation rates.
Authors:Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Jiacong Wu, An Guo, Jiawei Shen, Bingzhuo Li, Zhenyu Chen
Abstract:
Large language models (LLMs) have driven significant progress across a wide range of real-world applications. Realizing such models requires substantial system-level support. Deep learning (DL) frameworks provide this foundation by enabling efficient model construction, distributed execution, and optimized deployment. The large parameter scale and extended execution cycles impose exacting demands on deep learning frameworks, particularly in terms of scalability, stability, and efficiency. Therefore, poor usability, limited functionality, and subtle bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste. However, a fundamental question has not been thoroughly investigated in previous studies, i.e., what challenges do DL frameworks face in supporting LLMs? To answer this question, we analyze issue reports from three major DL frameworks (i.e., MindSpore, PyTorch, and TensorFlow) and eight associated LLM toolkits such as Megatron. Based on a manual review of these reports, we construct a taxonomy that captures LLM-centric framework bugs, user requirements, and user questions. We then refine and enrich this taxonomy through interviews with 11 LLM users and eight DL framework developers. Based on the constructed taxonomy and findings summarized from interviews, our study further reveals key technical challenges and mismatches between LLM user needs and developer priorities.
中文: 大型语言模型(LLMs)需要强大的深度学习框架来支持高效开发和部署,然而通过分析问题报告及与用户和开发者的访谈发现,可用性问题、系统缺陷以及用户需求与开发者优先级之间的不匹配等挑战影响了其效能。
English: Large language models (LLMs) require robust deep learning frameworks for efficient development and deployment, yet challenges like usability issues, bugs, and mismatches between user needs and developer priorities hinder their effectiveness, as revealed through an analysis of issue reports and interviews with users and developers.
Authors:An Guo, Xinyu Gao, Chunrong Fang, Haoxiang Tian, Weisong Sun, Yanzhou Mu, Shuncheng Tang, Lei Ma, Zhenyu Chen
Abstract:
Accurately perceiving complex driving environments is essential for ensuring the safe operation of autonomous vehicles. With the tremendous progress in deep learning and communication technologies, cooperative perception with Vehicle-to-Everything (V2X) technologies has emerged as a solution to overcome the limitations of single-agent perception systems in perceiving distant objects and occlusions. Despite the considerable advancements, V2X cooperative perception systems require thorough testing and continuous enhancement of system performance. Given that V2X driving scenes entail intricate communications with multiple vehicles across various geographic locations, creating V2X test scenes for these systems poses a significant challenge. Moreover, current testing methodologies rely on manual data collection and labeling, which are both time-consuming and costly.
In this paper, we design and implement V2XGen, an automated testing generation tool for V2X cooperative perception systems. V2XGen utilizes a high-fidelity approach to generate realistic cooperative object instances and strategically place them within the background data in crucial positions. Furthermore, V2XGen adopts a fitness-guided V2X scene generation strategy for the transformed scene generation process and improves testing efficiency. We conduct experiments on V2XGen using multiple cooperative perception systems with different fusion schemes to assess its performance on various tasks. The experimental results demonstrate that V2XGen is capable of generating realistic test scenes and effectively detecting erroneous behaviors in different V2X-oriented driving conditions. Furthermore, the results validate that retraining systems under test with the generated scenes can enhance average detection precision while reducing occlusion and long-range perception errors.
中文:V2XGen是一种自动化工具,旨在高效生成V2X协同感知系统的真实测试场景,通过利用这些场景重新训练系统,能有效识别错误并提升检测精度。
English: V2XGen is an automated tool designed to efficiently generate realistic test scenes for V2X cooperative perception systems, effectively identifying errors and improving detection accuracy by retraining with these scenes.
Authors:Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Chunyang Chen, Shaomin Zhu, Zhenyu Chen
Abstract:
The assurance of mobile app GUI is more and more significant. Automated GUI testing approaches of different strategies have been developed, while there are still huge gaps between the approaches and the app business logic, not taking the completion of specific testing scenarios as the exploration target, leading to the exploration missing of critical app functionalities. Learning from the manual testing, which takes testing scenarios with app business logic as the basic granularity, in this paper, we utilize the LLMs to understand the semantics presented in app GUI and how they are mapped in the testing context based on specific testing scenarios. Then, scenario-based GUI tests are generated with the guidance of multi-agent collaboration. Specifically, we propose ScenGen, a novel LLM-guided scenario-based GUI testing approach involving five agents to respectively take responsibilities of different phases of the manual testing process. The Observer perceives the app GUI state by extracting GUI widgets and forming GUI layouts, understanding the expressed semantics. Then the app GUI info is sent to the Decider to make decisions on target widgets based on the target testing scenarios. The decision-making process takes the completion of specific testing scenarios as the exploration target. The Executor then executes the demanding operations on the apps. The execution results are checked by the Supervisor on whether the generated tests are consistent with the completion target of the testing scenarios, ensuring the traceability of the test generation and execution. Furthermore, the corresponding GUI test operations are recorded to the context memory by Recorder as an important basis for further decision-making, meanwhile monitoring the runtime bug occurrences. ScenGen is evaluated and the results show that ScenGen can effectively generate scenario-based GUI tests guided by LLMs.
中文: 本文提出了ScenGen,一种基于大语言模型引导的多智能体协作方法,通过理解应用语义并以特定测试场景为目标生成基于场景的GUI测试,有效弥合了自动化测试与业务逻辑之间的差距。
English: This paper introduces ScenGen, a novel LLM-guided approach that employs multi-agent collaboration to generate scenario-based GUI tests by understanding app semantics and targeting specific testing scenarios, effectively bridging the gap between automated testing and business logic.
Authors:Yugeng Liu, Zheng Li, Hai Huang, Michael Backes, Yang Zhang
Abstract:
Machine learning (ML) models are proving to be vulnerable to a variety of attacks that allow the adversary to learn sensitive information, cause mispredictions, and more. While these attacks have been extensively studied, current research predominantly focuses on analyzing each attack type individually. In practice, however, adversaries may employ multiple attack strategies simultaneously rather than relying on a single approach. This prompts a crucial yet underexplored question: When the adversary has multiple attacks at their disposal, are they able to mount or amplify the effect of one attack with another? In this paper, we take the first step in studying the strategic interactions among different attacks, which we define as attack compositions. Specifically, we focus on four well-studied attacks during the model's inference phase: adversarial examples, attribute inference, membership inference, and property inference. To facilitate the study of their interactions, we propose a taxonomy based on three stages of the attack pipeline: preparation, execution, and evaluation. Using this taxonomy, we identify four effective attack compositions, such as property inference assisting attribute inference at its preparation level and adversarial examples assisting property inference at its execution level. We conduct extensive experiments on the attack compositions using three ML model architectures and three benchmark image datasets. Empirical results demonstrate the effectiveness of these four attack compositions. We implement and release a modular reusable toolkit, COAT. Arguably, our work serves as a call for researchers and practitioners to consider advanced adversarial settings involving multiple attack strategies, aiming to strengthen the security and robustness of AI systems.
中文: 本文开创性地研究了机器学习中多种攻击策略的组合使用,通过提出分类框架和工具包揭示了攻击间协同增强的威胁,呼吁学界关注复合攻击以提升人工智能系统的安全防护。
English: This paper pioneers the study of attack compositions in machine learning, where adversaries strategically combine multiple inference-phase attacks to amplify their effectiveness, and introduces a taxonomy and toolkit to advance research on securing AI systems against such multi-strategy threats.
Authors:Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang
Abstract:
Recent reasoning large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, exhibit strong performance on complex tasks through test-time inference scaling. However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple questions (e.g., overthinking). In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors and substantially increase computational overhead without compromising model utility. Therefore, we propose a novel loss framework consisting of three components: (1) Priority Cross-Entropy Loss, a modification of the standard cross-entropy objective that emphasizes key tokens by leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss, which encourages the model to initiate additional reasoning paths during inference; and (3) Delayed Termination Loss, which is designed to extend the reasoning process and defer the generation of final outputs. We optimize and evaluate our attack for the GSM8K and ORCA datasets on DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen. Empirical results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance. Furthermore, our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.
中文: 近期推理大语言模型如OpenAI o1和DeepSeek-R1因过度推理导致计算效率低下,对抗性输入可被设计来利用此缺陷,在不影响输出质量的情况下显著增加处理时间;通过提出的新型损失框架,推理长度可延长3至9倍且保持性能。
English: Recent reasoning LLMs like OpenAI o1 and DeepSeek-R1 face computational inefficiencies from excessive reasoning, which adversarial inputs can exploit to significantly increase processing time without affecting output quality, as demonstrated by a novel loss framework that extends reasoning length by 3x to 9x while maintaining utility.
Authors:Yugeng Liu, Tianshuo Cong, Michael Backes, Zheng Li, Yang Zhang
Abstract:
Large Language Models (LLMs) have experienced rapid advancements, with applications spanning a wide range of fields, including sentiment classification, review generation, and question answering. Due to their efficiency and versatility, researchers and companies increasingly employ LLM-generated data to train their models. However, the inability to track content produced by LLMs poses a significant challenge, potentially leading to copyright infringement for the LLM owners. In this paper, we propose a method for injecting watermarks into LLM-generated datasets, enabling the tracking of downstream tasks to detect whether these datasets were produced using the original LLM. These downstream tasks can be divided into two categories. The first involves using the generated datasets at the input level, commonly for training classification tasks. The other is the output level, where model trainers use LLM-generated content as output for downstream tasks, such as question-answering tasks. We design a comprehensive set of experiments to evaluate both watermark methods. Our results indicate the high effectiveness of our watermark approach. Additionally, regarding model utility, we find that classifiers trained on the generated datasets achieve a test accuracy exceeding 0.900 in many cases, suggesting that the utility of such models remains robust. For the output-level watermark, we observe that the quality of the generated text is comparable to that produced using real-world datasets. Through our research, we aim to advance the protection of LLM copyrights, taking a significant step forward in safeguarding intellectual property in this domain.
中文: 本文提出一种针对大语言模型生成数据集的水印技术,可在下游任务中追踪数据来源以保护版权,同时确保模型性能稳定且生成文本质量优良。
English: This paper introduces a watermarking technique for LLM-generated datasets to track their use in downstream tasks, effectively protecting copyright while maintaining high model utility and text quality.
Authors:Rui Wen, Yiyong Liu, Michael Backes, Yang Zhang
Abstract:
Data reconstruction attacks, which aim to recover the training dataset of a target model with limited access, have gained increasing attention in recent years. However, there is currently no consensus on a formal definition of data reconstruction attacks or appropriate evaluation metrics for measuring their quality. This lack of rigorous definitions and universal metrics has hindered further advancement in this field. In this paper, we address this issue in the vision domain by proposing a unified attack taxonomy and formal definitions of data reconstruction attacks. We first propose a set of quantitative evaluation metrics that consider important criteria such as quantifiability, consistency, precision, and diversity. Additionally, we leverage large language models (LLMs) as a substitute for human judgment, enabling visual evaluation with an emphasis on high-quality reconstructions. Using our proposed taxonomy and metrics, we present a unified framework for systematically evaluating the strengths and limitations of existing attacks and establishing a benchmark for future research. Empirical results, primarily from a memorization perspective, not only validate the effectiveness of our metrics but also offer valuable insights for designing new attacks.
中文: 本文针对视觉领域的数据重建攻击提出了统一分类和形式化定义,通过引入量化评估指标和框架系统评估现有方法,并为未来研究建立基准。
English: This paper proposes a unified taxonomy and formal definitions for data reconstruction attacks in computer vision, introducing quantitative metrics and a framework to systematically evaluate existing methods while establishing a benchmark for future research.
Authors:Xinyue Shen, Yun Shen, Michael Backes, Yang Zhang
Abstract:
Knowledge files have been widely used in large language model (LLM) agents, such as GPTs, to improve response quality. However, concerns about the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.
中文: 本研究在GPT等大型语言模型代理中发现了五种知识文件泄露途径,揭示出代码解释器存在高危漏洞,能以95.95%成功率直接下载原始文件,且28.80%泄露文件涉及版权材料。
English: This study identifies five leakage vectors in LLM agents like GPTs that expose sensitive knowledge file data, revealing a critical vulnerability with the Code Interpreter enabling direct file downloads at a 95.95% success rate and exposing 28.80% of leaked files as copyrighted material.
Authors:Yanwei Gong, Junchao Fan, Ruichen Zhang, Dusit Niyato, Yingying Yao, Xiaolin Chang
Abstract:
The rapid growth of the low-altitude economy has driven the widespread adoption of unmanned aerial vehicles (UAVs). This growing deployment presents new challenges for UAV trajectory planning in complex urban environments. However, existing studies often overlook key factors, such as urban airspace constraints and economic efficiency, which are essential in low-altitude economy contexts. Deep reinforcement learning (DRL) is regarded as a promising solution to these issues, while its practical adoption remains limited by low learning efficiency. To overcome this limitation, we propose a novel UAV trajectory planning framework that combines DRL with large language model (LLM) reasoning to enable safe, compliant, and economically viable path planning. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple metrics, including data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency. These results validate the effectiveness of our approach in addressing UAV trajectory planning key challenges under constraints of the low-altitude economy networking.
本研究提出了一种结合深度强化学习与大语言模型推理的新型无人机轨迹规划框架,在低空经济网络约束下显著提升了路径规划的安全性、合规性和经济性,各项性能指标均优于现有方法。
The proposed framework integrates deep reinforcement learning with large language model reasoning to significantly enhance UAV trajectory planning, achieving superior performance in safety, compliance, and efficiency metrics within low-altitude urban environments.
Authors:William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
Abstract:
Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we will release our dataset and code, see https://research.zenseact.com/publications/R3D2/.
中文: 本文提出R3D2轻量级扩散模型,通过实时生成渲染效果实现逼真的3D资产插入驾驶场景,显著提升自动驾驶验证的可扩展仿真真实性。
English: This paper introduces R3D2, a lightweight diffusion model that enables realistic 3D asset insertion into driving scenes by generating rendering effects in real time, significantly enhancing simulation realism for scalable autonomous driving validation.
Authors:Tianxin Hu, Xinhang Xu, Thien-Minh Nguyen, Fen Liu, Shenghai Yuan, Lihua Xie
Abstract:
Multi-axle Swerve-drive Autonomous Mobile Robots (MS-AGVs) equipped with independently steerable wheels are commonly used for high-payload transportation. In this work, we present a novel model predictive control (MPC) method for MS-AGV trajectory tracking that takes tire wear minimization consideration in the objective function. To speed up the problem-solving process, we propose a hierarchical controller design and simplify the dynamic model by integrating the \textit{magic formula tire model} and \textit{simplified tire wear model}. In the experiment, the proposed method can be solved by simulated annealing in real-time on a normal personal computer and by incorporating tire wear into the objective function, tire wear is reduced by 19.19\% while maintaining the tracking accuracy in curve-tracking experiments. In the more challenging scene: the desired trajectory is offset by 60 degrees from the vehicle's heading, the reduction in tire wear increased to 65.20\% compared to the kinematic model without considering the tire wear optimization.
中文: 本研究提出了一种用于多轴独立转向移动机器人的新型模型预测控制方法,通过在目标函数中考虑轮胎磨损最小化,并采用简化的分层控制器设计,在保持轨迹跟踪精度的同时将轮胎磨损降低了高达65.20%。
English: This study introduces a novel model predictive control method for multi-axle swerve-drive autonomous mobile robots that incorporates tire wear minimization into the objective function, achieving up to 65.20% reduction in tire wear while maintaining trajectory tracking accuracy through a simplified hierarchical controller design.
Authors:Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu
Abstract:
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
中文摘要:本文提出了一种通过无监督预训练和有监督微调相结合的生成式奖励模型,该模型将生成与判别方法统一于同一目标下,在多项任务中展现出优异的泛化能力并显著超越基线模型。
English Summary: This paper introduces a generative reward model trained through unsupervised pre-training and supervised fine-tuning, which effectively generalizes across multiple tasks and outperforms baseline models by linking generative and discriminative approaches under a unified objective.
Authors:En Xu, Huandong Wang, Yunke Zhang, Sibo Li, Yinzhou Tang, Zhilun Zhou, Yuming Lin, Yuan Yuan, Xiaochen Fan, Jingtao Ding, Yong Li
Abstract:
Urban systems are typical examples of complex systems, where the integration of physics-based modeling with artificial intelligence (AI) presents a promising paradigm for enhancing predictive accuracy, interpretability, and decision-making. In this context, AI excels at capturing complex, nonlinear relationships, while physics-based models ensure consistency with real-world laws and provide interpretable insights. We provide a comprehensive review of physics-informed AI methods in urban applications. The proposed taxonomy categorizes existing approaches into three paradigms - Physics-Integrated AI, Physics-AI Hybrid Ensemble, and AI-Integrated Physics - and further details seven representative methods. This classification clarifies the varying degrees and directions of physics-AI integration, guiding the selection and development of appropriate methods based on application needs and data availability. We systematically examine their applications across eight key urban domains: energy, environment, economy, transportation, information, public services, emergency management, and the urban system as a whole. Our analysis highlights how these methodologies leverage physical laws and data-driven models to address urban challenges, enhancing system reliability, efficiency, and adaptability. By synthesizing existing methodologies and their urban applications, we identify critical gaps and outline future research directions, paving the way toward next-generation intelligent urban system modeling.
中文摘要:物理模型与人工智能的融合通过结合数据驱动洞察和现实物理规律提升城市系统预测能力,本文通过三种融合范式的分类框架及其在八大城市领域的应用综述了这一方法。
English Summary: The integration of physics-based modeling with AI enhances urban system predictions by combining data-driven insights with real-world physical laws, as reviewed through a taxonomy of three integration paradigms and their applications across eight urban domains.
Authors:Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
Abstract:
Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
中文摘要:通过ViCrit任务,强化学习被有效应用于视觉语言模型中,训练模型检测描述中的细微合成幻觉,从而在不依赖记忆的情况下,显著提升了多种基准测试中的视觉感知能力。
English Summary: Reinforcement learning is effectively applied to vision-language models through the ViCrit task, which trains models to detect subtle synthetic hallucinations in captions, leading to improved visual perception across various benchmarks without relying on memorization.
Authors:Ming Li, Zhengyuan Yang, Xiyao Wang, Dianqi Li, Kevin Lin, Tianyi Zhou, Lijuan Wang
Abstract:
Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic "thinking cues", LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the "thinking cues" each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.
中文摘要:大型推理模型存在过度思考问题,我们提出的FoReaL-Decoding方法通过主导模型与草稿模型的协作解码,在保持86%-100%性能的同时将计算成本降低30-50%、推理链长度缩减达40%。
English Summary: Large reasoning models exhibit overthinking with verbose chains of thought, but our proposed FoReaL-Decoding method reduces computational costs by 30-50% and chain length by up to 40% while maintaining performance through collaborative fast-slow model switching.
Authors:Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
Abstract:
Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.
中文摘要:音频感知大语言模型(ALLMs)能够有效评估语音风格,其中Gemini-2.5-pro与人类评估者的一致性相当,但当前语音模型在风格控制和自然对话生成方面仍需改进。
English Summary: Audio-aware large language models (ALLMs) can effectively evaluate speech styles, with Gemini-2.5-pro showing human-level agreement, though current spoken language models still need improvement in style control and natural dialogue generation.
Authors:Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
Abstract:
Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs' causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs' causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2.
中文: 当前大语言模型仅能基于参数中嵌入的知识进行浅层因果推理,但通过融入通用知识和目标导向提示的G^2-Reasoner方法,可显著提升其在新颖与反事实场景中实现类人深度因果推理的能力。
English: Large language models currently perform only shallow causal reasoning based on embedded knowledge, but the proposed G^2-Reasoner method significantly enhances their capability toward genuine human-like reasoning by incorporating general knowledge and goal-oriented prompts.
Authors:Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, Wenhai Wang
Abstract:
Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.
中文摘要:本文提出了InternSpatial——最大的视觉语言模型空间推理开源数据集,包含1200万个跨多样化视觉场景和指令格式的问答对,其配套基准测试显示经过训练的模型性能显著提升。
English Summary: This paper introduces InternSpatial, the largest open-source dataset for spatial reasoning in vision-language models, featuring 12 million QA pairs across diverse visual settings and instruction formats, along with a benchmark that demonstrates significant performance improvements in trained models.
Authors:Mingrui Zhu, Xiru Chen, Xin Wei, Nannan Wang, Xinbo Gao
Abstract:
Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, we introduce textual semantics at two levels: the mask semantic level and the text semantic level, both derived from textual descriptions extracted by large Vision-Language Models (VLMs). Building on this, we propose Textual Semantic Guidance for infrared and visible image fusion, termed TeSG, which guides the image synthesis process in a way that is optimized for downstream tasks such as detection and segmentation. Specifically, TeSG consists of three core components: a Semantic Information Generator (SIG), a Mask-Guided Cross-Attention (MGCA) module, and a Text-Driven Attentional Fusion (TDAF) module. The SIG generates mask and text semantics based on textual descriptions. The MGCA module performs initial attention-based fusion of visual features from both infrared and visible images, guided by mask semantics. Finally, the TDAF module refines the fusion process with gated attention driven by text semantics. Extensive experiments demonstrate the competitiveness of our approach, particularly in terms of performance on downstream tasks, compared to existing state-of-the-art methods.
中文: 提出的TeSG框架通过融合视觉语言模型生成的掩码和文本语义,借助语义引导和注意力融合模块,有效提升了红外与可见光图像融合在下游任务中的性能表现。
English: The proposed TeSG framework enhances infrared and visible image fusion by integrating mask and text semantics from vision-language models, improving performance in downstream tasks through specialized modules for semantic guidance and attention-based fusion.
Authors:Huaijie Wang, De Cheng, Lingfeng He, Yan Li, Jie Li, Nannan Wang, Xinbo Gao
Abstract:
Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time while retaining previously acquired knowledge. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods, like prompt pool-based approaches and adapter tuning, have shown great attraction in CIL. However, these methods either introduce additional parameters that increase memory usage, or rely on rigid regularization techniques which reduce forgetting but compromise model flexibility. To overcome these limitations, we propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware Parameter Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL. Specifically, the IPR method assesses the sensitivity of network parameters to prior tasks using a novel parameter-importance algorithm. It then selectively constrains updates within the shared adapter according to these importance values, thereby preserving previously acquired knowledge while maintaining the model's flexibility. However, it still exhibits slight semantic differences in previous knowledge to accommodate new incremental tasks, leading to decision boundaries confusion in classifier. To eliminate this confusion, TSDC trains a unified classifier by compensating prototypes with trainable semantic drift. Extensive experiments on five CIL benchmarks demonstrate the effectiveness of the proposed method, showing superior performances to existing state-of-the-art methods.
中文: 提出的弹性知识保持与补偿方法通过重要性感知参数正则化和可训练语义漂移补偿,使AI模型能在持续学习新类别时有效保留已有知识,在多个基准测试中优于现有先进方法。
English: The proposed Elastic Knowledge Preservation and Compensation (EKPC) method combines Importance-aware Parameter Regularization and Trainable Semantic Drift Compensation to enable AI models to continuously learn new classes while effectively preserving previous knowledge, outperforming state-of-the-art methods on multiple benchmarks.
Authors:Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, Wentao Zhang
Abstract:
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
Chinese: 近期研究提出ReLIFT混合训练方法,通过交替使用强化学习和监督微调来增强大语言模型的推理能力,在仅需少量演示数据的情况下,于多个基准测试中实现了显著性能提升。
English: Recent research introduces ReLIFT, a hybrid training method that alternates between reinforcement learning and supervised fine-tuning to enhance LLM reasoning, achieving notable performance gains across benchmarks while using minimal demonstration data.
Authors:Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang
Abstract:
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
Chinese: 近期研究提出ReLIFT混合训练方法,通过交替使用强化学习和监督微调来增强大语言模型的推理能力,在仅需少量演示数据的情况下,于多个基准测试中实现了显著性能提升。
English: Recent research introduces ReLIFT, a hybrid training method that alternates between reinforcement learning and supervised fine-tuning to enhance LLM reasoning, achieving notable performance gains across benchmarks while using minimal demonstration data.
Authors:Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Abstract:
Robotic manipulation of unseen objects via natural language commands remains challenging. Language driven robotic grasping (LDRG) predicts stable grasp poses from natural language queries and RGB-D images. We propose MapleGrasp, a novel framework that leverages mask-guided feature pooling for efficient vision-language driven grasping. Our two-stage training first predicts segmentation masks from CLIP-based vision-language features. The second stage pools features within these masks to generate pixel-level grasp predictions, improving efficiency, and reducing computation. Incorporating mask pooling results in a 7% improvement over prior approaches on the OCID-VLG benchmark. Furthermore, we introduce RefGraspNet, an open-source dataset eight times larger than existing alternatives, significantly enhancing model generalization for open-vocabulary grasping. MapleGrasp scores a strong grasping accuracy of 89\% when compared with competing methods in the RefGraspNet benchmark. Our method achieves comparable performance to larger Vision-Language-Action models on the LIBERO benchmark, and shows significantly better generalization to unseen tasks. Real-world experiments on a Franka arm demonstrate 73% success rate with unseen objects, surpassing competitive baselines by 11%. Code is provided in our github repository.
中文:MapleGrasp提出了一种利用掩码引导特征池化的新型语言驱动抓取框架,在多个基准测试中显著提升了准确性和效率,并在真实世界实验中展现出卓越的泛化能力。
English: MapleGrasp introduces a novel framework using mask-guided feature pooling for language-driven robotic grasping, achieving significant improvements in accuracy and efficiency across benchmarks while demonstrating superior generalization in real-world experiments.
Authors:Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang
Abstract:
Large language models (LLMs) excel at many supervised tasks but often struggle with structured reasoning in unfamiliar settings. This discrepancy suggests that standard fine-tuning pipelines may instill narrow, domain-specific heuristics rather than fostering general-purpose thinking strategies. In this work, we propose a "play to learn" framework that fine-tunes LLMs through reinforcement learning on a suite of seven custom logic puzzles, each designed to cultivate distinct reasoning skills such as constraint propagation, spatial consistency, and symbolic deduction. Using a reinforcement learning setup with verifiable rewards, models receive binary feedback based on puzzle correctness, encouraging iterative, hypothesis-driven problem solving. We demonstrate that this training approach significantly improves out-of-distribution performance on a range of mathematical benchmarks, especially for mid-difficulty problems that require multi-step reasoning. Analyses across problem categories and difficulty levels reveal that puzzle training promotes transferable reasoning routines, strengthening algebraic manipulation, geometric inference, and combinatorial logic, while offering limited gains on rote or highly specialized tasks. These findings show that reinforcement learning over logic puzzles reshapes the internal reasoning of LLMs, enabling more robust and compositional generalization without relying on task-specific symbolic tools.
中文: 该研究提出一种“边玩边学”的强化学习框架,通过逻辑谜题训练大语言模型的通用推理能力,显著提升了其在数学任务中的泛化表现,培养了约束传播、符号推理等可迁移的思维策略。
English: The study introduces a "play to learn" reinforcement learning framework using logic puzzles to enhance large language models' general reasoning skills, significantly improving their performance on diverse mathematical tasks by fostering transferable strategies like constraint propagation and symbolic deduction.
Authors:Xue Wu, Jingwei Xin, Zhijun Tu, Jie Hu, Jie Li, Nannan Wang, Xinbo Gao
Abstract:
Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.
中文:VPD-SR是一种新颖的视觉感知扩散蒸馏框架,通过显式语义监督和高频感知损失提升语义一致性和图像重建质量,仅需单步采样即可实现优于现有方法的超分辨率性能。
English: VPD-SR is a novel diffusion distillation framework for image super-resolution that enhances semantic fidelity and perceptual quality through explicit semantic supervision and high-frequency perception loss, achieving state-of-the-art performance with just one-step sampling.
Authors:Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu
Abstract:
The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.
中文: VeBrain是一个统一框架,通过将机器人控制任务转化为基于文本的视觉任务,使多模态大语言模型具备现实世界的感知、推理与控制能力,并在各类基准测试和实体机器人应用中展现出卓越性能。
English: VeBrain is a unified framework that enables multimodal large language models to perform perception, reasoning, and control for real-world robots by converting robotic tasks into text-based visual tasks and demonstrating superior performance across benchmarks and physical deployments.
Authors:Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia
Abstract:
Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.
中文: DynamicBench是一种新型基准测试,通过双路径检索系统评估大语言模型处理实时信息的能力,在无文档和文档辅助场景下均展现出优于GPT4o的性能表现。
English: DynamicBench is a novel benchmark that evaluates large language models' ability to process real-time information through a dual-path retrieval system, demonstrating superior performance over GPT4o in both document-free and document-assisted scenarios.
Authors:Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li
Abstract:
Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.
中文: 本研究通过引入预训练语音语言模型或语言模型的语义约束作为额外监督信号,提升了视听目标说话人提取的性能,在不增加推理成本的情况下改善了语音质量和清晰度,并在多语言及视觉线索缺失场景中展现出稳健的改进效果。
English: This study enhances audio-visual target speaker extraction by integrating linguistic constraints from pre-trained speech-language or language models as additional supervision, improving speech quality and intelligibility without extra inference cost, while demonstrating robust performance in multilingual and visual-impaired scenarios.
Authors:Yuejiao Wang, Xianmin Gong, Xixin Wu, Patrick Wong, Hoi-lam Helene Fung, Man Wai Mak, Helen Meng
Abstract:
Early detection is crucial for timely intervention aimed at preventing and slowing the progression of neurocognitive disorder (NCD), a common and significant health problem among the aging population. Recent evidence has suggested that language-related functional magnetic resonance imaging (fMRI) may be a promising approach for detecting cognitive decline and early NCD. In this paper, we proposed a novel, naturalistic language-related fMRI task for this purpose. We examined the effectiveness of this task among 97 non-demented Chinese older adults from Hong Kong. The results showed that machine-learning classification models based on fMRI features extracted from the task and demographics (age, gender, and education year) achieved an average area under the curve of 0.86 when classifying participants' cognitive status (labeled as NORMAL vs DECLINE based on their scores on a standard neurcognitive test). Feature localization revealed that the fMRI features most frequently selected by the data-driven approach came primarily from brain regions associated with language processing, such as the superior temporal gyrus, middle temporal gyrus, and right cerebellum. The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.
早期发现神经认知障碍至关重要,本研究提出一种新型语言相关fMRI任务,结合机器学习与人口统计数据,能高效识别老年人认知衰退且准确率高。
Early detection of neurocognitive disorder is vital, and this study introduces a novel language-related fMRI task that, combined with machine learning and demographic data, effectively identifies cognitive decline in older adults with high accuracy.
Authors:Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
Abstract:
Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.
中文摘要:CoMemo模型通过双路径架构和新型位置编码机制RoPE-DHR,有效解决了现有大视觉语言模型中的视觉信息忽略和空间关系保持问题,在多项基准测试中表现优异。
English Summary: The CoMemo model introduces a dual-path architecture and RoPE-DHR positional encoding to overcome limitations in current Large Vision-Language Models, such as visual neglect and poor spatial awareness, achieving superior performance across multiple benchmarks.
Authors:Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
Abstract:
In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.
Chinese: Qwen3 Embedding系列基于Qwen3模型,通过多阶段训练流程和模型融合策略,在文本嵌入和重排序方面取得显著进步,在多语言基准测试中达到顶尖水平,并提供多种模型规模以适应不同部署需求。
English: The Qwen3 Embedding series, built on Qwen3 models, significantly advances text embedding and reranking through a multi-stage training pipeline and model merging, achieving state-of-the-art results in multilingual benchmarks and offering scalable model sizes for diverse deployment needs.
Authors:Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo
Abstract:
Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.
Chinese: 最新大型语言模型在推理能力上取得显著进展,但其真正的流体智力仍存疑问;为此提出的DRE-Bench动态评估基准表明,尽管模型在低阶认知任务中表现稳健,但在高阶认知和泛化能力方面仍存在明显不足。
English: Recent advances in large language models show impressive reasoning abilities, but their true fluid intelligence remains uncertain, leading to the development of DRE-Bench, a dynamic benchmark that reveals LLMs struggle with high-level cognition and generalization despite competence in simpler tasks.
Authors:Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo
Abstract:
Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.
Chinese: 最新大型语言模型在推理能力上取得显著进展,但其真正的流体智力仍存疑问;为此提出的DRE-Bench动态评估基准表明,尽管模型在低阶认知任务中表现稳健,但在高阶认知和泛化能力方面仍存在明显不足。
English: Recent advances in large language models show impressive reasoning abilities, but their true fluid intelligence remains uncertain, leading to the development of DRE-Bench, a dynamic benchmark that reveals LLMs struggle with high-level cognition and generalization despite competence in simpler tasks.
Authors:Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng
Abstract:
Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity.
中文: 本研究提出了一种基于扩散模型的构音障碍语音重建系统,通过潜在扩散模型结合内容和身份编码器,显著提高了语音可懂度和说话人相似性,并在UASpeech语料库上验证了其有效性。
English: This study introduces a novel diffusion-based dysarthric speech reconstruction system that enhances speech intelligibility and speaker similarity by utilizing a latent diffusion model with content and identity encoders, demonstrating significant improvements on the UASpeech corpus.
Authors:Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin
Abstract:
Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.
Chinese: 大型视觉语言模型常产生与视觉信息不符的内容,而本研究提出的描述敏感注意力干预方法通过利用描述查询的注意力模式,以最小的额外推理成本有效缓解了这种幻觉现象。
English: Large Vision-Language Models often generate content inconsistent with visual data, but the proposed Caption-sensitive Attention Intervention method effectively reduces this hallucination with minimal added inference cost by leveraging attention patterns from caption queries.
Authors:Weixiang Zhao, Jiahe Guo, Yang Deng, Xingyu Sui, Yulin Hu, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu
Abstract:
Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning. Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential. Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model's representation space. Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions. Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance. Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner.
中文: 本研究针对大型推理模型过度思考的问题,提出了效率引导和自奖励效率强化学习两种轻量方法,在多个数学推理基准测试中显著缩短了推理长度,同时保持或提升了任务表现。
English: This study identifies the issue of overthinking in large reasoning models and introduces two lightweight techniques—Efficiency Steering and Self-Rewarded Efficiency RL—that significantly reduce reasoning length while maintaining or enhancing performance across multiple benchmarks.
Authors:Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou, Fang Zeng, Yijun Luo, Tianyu Lin, Yihang Su, Yichen Wu, Kai Zhang, Zhen Xiang, Tianming Liu, Ninghao Liu, Lichao Sun, Yixuan Yuan, Xiang Li
Abstract:
Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near-perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference-driven reasoning, RadFabric advances AI-driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.
中文: RadFabric是一种多模态框架,通过整合视觉与文本推理提升胸部X光分析能力,相比传统系统实现了更优的诊断准确性和解剖学精确度。
English: RadFabric is a multimodal framework that enhances chest X-ray analysis by integrating visual and textual reasoning, achieving superior diagnostic accuracy and anatomical precision compared to conventional systems.
Authors:Yifei Sun, Daniel Chahine, Qinghao Wen, Tianming Liu, Xiang Li, Yixuan Yuan, Fernando Calamante, Jinglei Lv
Abstract:
Understanding brain dynamics is important for neuroscience and mental health. Functional magnetic resonance imaging (fMRI) enables the measurement of neural activities through blood-oxygen-level-dependent (BOLD) signals, which represent brain states. In this study, we aim to predict future human resting brain states with fMRI. Due to the 3D voxel-wise spatial organization and temporal dependencies of the fMRI data, we propose a novel architecture which employs a 4D Shifted Window (Swin) Transformer as encoder to efficiently learn spatio-temporal information and a convolutional decoder to enable brain state prediction at the same spatial and temporal resolution as the input fMRI data. We used 100 unrelated subjects from the Human Connectome Project (HCP) for model training and testing. Our novel model has shown high accuracy when predicting 7.2s resting-state brain activities based on the prior 23.04s fMRI time series. The predicted brain states highly resemble BOLD contrast and dynamics. This work shows promising evidence that the spatiotemporal organization of the human brain can be learned by a Swin Transformer model, at high resolution, which provides a potential for reducing the fMRI scan time and the development of brain-computer interfaces in the future.
中文: 本研究提出了一种新型4D Swin Transformer模型,能够基于fMRI数据准确预测未来静息态大脑活动,展现了减少扫描时间和推动脑机接口发展的潜力。
English: This study introduces a novel 4D Swin Transformer model that accurately predicts future resting-state brain activity from fMRI data, demonstrating potential for reducing scan times and advancing brain-computer interfaces.
Authors:Cheng Wang, Yu Jiang, Zhihao Peng, Chenxin Li, Changbae Bang, Lin Zhao, Jinglei Lv, Jorge Sepulcre, Carl Yang, Lifang He, Tianming Liu, Daniel Barron, Quanzheng Li, Randy Hirschtick, Byung-Hoon Kim, Xiang Li, Yixuan Yuan
Abstract:
Functional Magnetic Resonance Imaging (fMRI) is essential for studying brain function and diagnosing neurological disorders, but current analysis methods face reproducibility and transferability issues due to complex pre-processing and task-specific models. We introduce NeuroSTORM (Neuroimaging Foundation Model with Spatial-Temporal Optimized Representation Modeling), a generalizable framework that directly learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications. NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100. Using a Mamba backbone and a shifted scanning strategy, it efficiently processes full 4D volumes. We also propose a spatial-temporal optimized pre-training approach and task-specific prompt tuning to improve transferability. NeuroSTORM outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI classification. It demonstrates strong clinical utility on datasets from hospitals in the U.S., South Korea, and Australia, achieving top performance in disease diagnosis and cognitive phenotype prediction. NeuroSTORM provides a standardized, open-source foundation model to improve reproducibility and transferability in fMRI-based clinical research.
中文: NeuroSTORM是一种开创性的通用框架,可直接从4D fMRI数据中学习,在多种应用中实现高效知识迁移,在多项任务中超越现有方法,同时提升临床研究的可重复性和迁移性。
English: NeuroSTORM is a groundbreaking generalizable framework that learns directly from 4D fMRI volumes, enabling efficient knowledge transfer across diverse applications and outperforming existing methods in multiple tasks while improving reproducibility and transferability in clinical research.
Authors:Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Jiaqi Wang, Yixin Cao, Aixin Sun
Abstract:
Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.
中文: 本研究开发的Light-ColPali/ColQwen2通过令牌合并策略,在可视化文档检索中将内存占用降至原有的2.8%同时保持94.6%的检索效能,为高效视觉文档检索建立了竞争优势基准。
English: This study introduces Light-ColPali/ColQwen2, which employs token merging to drastically cut memory usage in Visualized Document Retrieval by up to 97.2% while preserving over 94% of retrieval performance, establishing an efficient baseline for future research.
Authors:Zixiang Li, Haoyu Wang, Wei Wang, Chuangchuang Tan, Yunchao Wei, Yao Zhao
Abstract:
Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process.
The proposed Dual-Conditional Inversion (DCI) framework overcomes the trade-off between reconstruction accuracy and editing flexibility in diffusion models by jointly conditioning on source prompts and reference images through dual-condition fixed-point optimization.
English Summary:
Authors:Zijian Li, Xiaocheng Feng, Huixin Liu, Yichong Huang, Ting Liu, Bing Qin
Abstract:
With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.
中文: 改进的FroM方法通过弗罗贝尼乌斯范数自适应合并微调模型,无需训练数据即可有效缓解任务干扰,并在多种场景中优于基线方法。
English: The improved FroM method adaptively merges fine-tuned models using the Frobenius norm without training data, effectively mitigating task interference and outperforming baseline approaches across diverse scenarios.
Authors:Weitao Ma, Xiyuan Du, Xiaocheng Feng, Lei Huang, Yichong Huang, Huiyi Zhang, Xiaoliang Yang, Baohang Li, Xiachong Feng, Ting Liu, Bing Qin
Abstract:
Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a \weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios. Our code will be available.
中文摘要:OnceEdit提出了一种基于集成的创新方法,通过动态权重和增强机制,实现在多个大语言模型间稳定高效地更新知识,其性能优于现有方法。
English Summary: OnceEdit introduces an ensemble-based approach with dynamic weighting and enhancement mechanisms to efficiently update knowledge across multiple large language models, outperforming existing methods in stability and editing efficiency.
Authors:Can Cui, Xindong Zheng, Ruining Deng, Quan Liu, Tianyuan Yao, Keith T Wilson, Lori A Coburn, Bennett A Landman, Haichun Yang, Yaohong Wang, Yuankai Huo
Abstract:
Anomaly detection has been widely studied in the context of industrial defect inspection, with numerous methods developed to tackle a range of challenges. In digital pathology, anomaly detection holds significant potential for applications such as rare disease identification, artifact detection, and biomarker discovery. However, the unique characteristics of pathology images, such as their large size, multi-scale structures, stain variability, and repetitive patterns, introduce new challenges that current anomaly detection algorithms struggle to address. In this quantitative study, we benchmark over 20 classical and prevalent anomaly detection methods through extensive experiments. We curated five digital pathology datasets, both real and synthetic, to systematically evaluate these approaches. Our experiments investigate the influence of image scale, anomaly pattern types, and training epoch selection strategies on detection performance. The results provide a detailed comparison of each method's strengths and limitations, establishing a comprehensive benchmark to guide future research in anomaly detection for digital pathology images.
中文: 本研究通过五个数字病理数据集对20多种异常检测方法进行基准测试,评估了它们在图像尺度、异常类型和训练策略方面的表现,为未来研究建立了全面的指导。
English: This study benchmarks over 20 anomaly detection methods using five digital pathology datasets, evaluating their performance across image scales, anomaly types, and training strategies to establish a comprehensive guide for future research.
Authors:Xiaofeng Cong, Yu-Xin Zhang, Haoran Wei, Yeying Jin, Junming Hou, Jie Gui, Jing Zhang, Dacheng Tao
Abstract:
While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model's generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset's utility and the model's superior performance in joint haze removal and brightness mapping.
中文摘要:DiffND框架通过确保亮度一致性的数据合成流程和结合预训练扩散模型与亮度感知优化的修复模型,解决了将夜间雾霾图像转换为等效白天亮度的关键难题。
English Summary: The DiffND framework addresses the overlooked challenge of converting nighttime hazy images to daytime-equivalent brightness through a data synthesis pipeline ensuring brightness consistency and a restoration model combining a pre-trained diffusion model with brightness-aware optimization.
Authors:Kien Nguyen, Clinton Fookes, Sridha Sridharan, Huy Nguyen, Feng Liu, Xiaoming Liu, Arun Ross, Dana Michalski, Tamás Endrei, Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zijing Gong, Yuhao Wang, Xuehu Liu, Pingping Zhang, Md Rashidunnabi, Hugo Proença, Kailash A. Hambarde, Saeid Rezaei
Abstract:
Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset's complexity. For additional details, please refer to the official website at https://agvpreid25.github.io.
中文摘要:AG-VPReID 2025挑战赛首次推出大规模视频跨视角行人重识别竞赛,基于新型数据集并通过多流架构与Transformer等先进方法,在空地跨域识别中实现了超过70%的首位命中率。
English Summary: The AG-VPReID 2025 Challenge introduces the first large-scale video-based aerial-ground person re-identification competition, featuring a novel dataset and achieving over 70% Rank-1 accuracy through advanced multi-stream and transformer-based solutions.
Authors:Alvaro Becerra, Roberto Daza, Ruth Cobos, Aythami Morales, Mutlu Cukurova, Julian Fierrez
Abstract:
This work investigates the use of multimodal biometrics to detect distractions caused by smartphone use during tasks that require sustained attention, with a focus on computer-based online learning. Although the methods are applicable to various domains, such as autonomous driving, we concentrate on the challenges learners face in maintaining engagement amid internal (e.g., motivation), system-related (e.g., course design) and contextual (e.g., smartphone use) factors. Traditional learning platforms often lack detailed behavioral data, but Multimodal Learning Analytics (MMLA) and biosensors provide new insights into learner attention. We propose an AI-based approach that leverages physiological signals and head pose data to detect phone use. Our results show that single biometric signals, such as brain waves or heart rate, offer limited accuracy, while head pose alone achieves 87%. A multimodal model combining all signals reaches 91% accuracy, highlighting the benefits of integration. We conclude by discussing the implications and limitations of deploying these models for real-time support in online learning environments.
中文: 本研究开发了一种多模态生物特征系统,利用生理信号和头部姿态数据检测在线学习中的手机使用分心行为,集成多信号后准确率达91%,而仅用头部姿态数据时为87%。
English: This study develops a multimodal biometric system using physiological signals and head pose data to detect smartphone-induced distractions during online learning, achieving 91% accuracy by integrating multiple signals compared to 87% with head pose alone.
Authors:Botao Zhu, Xianbin Wang, Lei Zhang, Xuemin, Shen
Abstract:
In collaborative systems with complex tasks relying on distributed resources, trust evaluation of potential collaborators has emerged as an effective mechanism for task completion. However, due to the network dynamics and varying information gathering latencies, it is extremely challenging to observe and collect all trust attributes of a collaborating device concurrently for a comprehensive trust assessment. In this paper, a novel progressive trust evaluation framework, namely chain-of-trust, is proposed to make better use of misaligned device attribute data. This framework, designed for effective task completion, divides the trust evaluation process into multiple chained stages based on task decomposition. At each stage, based on the task completion process, the framework only gathers the latest device attribute data relevant to that stage, leading to reduced trust evaluation complexity and overhead. By leveraging advanced in-context learning, few-shot learning, and reasoning capabilities, generative AI is then employed to analyze and interpret the collected data to produce correct evaluation results quickly. Only devices deemed trustworthy at this stage proceed to the next round of trust evaluation. The framework ultimately determines devices that remain trustworthy across all stages. Experimental results demonstrate that the proposed framework achieves high accuracy in trust evaluation.
中文摘要:本文提出了一种链式信任框架,通过将信任评估分解为多个阶段并利用生成式AI分析各阶段相关属性,在显著降低评估复杂度的同时保持了高精度的信任判断。
English Summary: This paper introduces a chain-of-trust framework that progressively evaluates devices through multiple stages using generative AI to analyze relevant attributes at each phase, significantly reducing complexity while maintaining high accuracy in trust assessment.
Authors:Xinting Liao, Weiming Liu, Jiaming Qian, Pengyang Zhou, Jiahe Xu, Wenjie Wang, Chaochao Chen, Xiaolin Zheng, Tat-Seng Chua
Abstract:
Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemingly OOD prompts, and OOD prompts by semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.
中文:联邦提示学习在视觉语言模型中面临分布外偏移下性能与鲁棒性的权衡,而提出的FOCoOp框架通过多组提示和优化技术,有效提升了分布式异构数据环境中的鲁棒性。
English: Federated prompt learning for vision-language models faces a trade-off between performance and robustness in out-of-distribution shifts, which is addressed by the proposed FOCoOp framework using multiple prompts and optimization techniques to enhance robustness across decentralized data.
Authors:Alejandro Peña, Julian Fierrez, Aythami Morales, Gonzalo Mancera, Miguel Lopez, Ruben Tolosana
Abstract:
The use of language technologies in high-stake settings is increasing in recent years, mostly motivated by the success of Large Language Models (LLMs). However, despite the great performance of LLMs, they are are susceptible to ethical concerns, such as demographic biases, accountability, or privacy. This work seeks to analyze the capacity of Transformers-based systems to learn demographic biases present in the data, using a case study on AI-based automated recruitment. We propose a privacy-enhancing framework to reduce gender information from the learning pipeline as a way to mitigate biased behaviors in the final tools. Our experiments analyze the influence of data biases on systems built on two different LLMs, and how the proposed framework effectively prevents trained systems from reproducing the bias in the data.
中文: 本研究分析了基于Transformer的系统在AI招聘中学习数据中人口统计偏见的能力,并提出了一种隐私增强框架来减少性别信息,从而有效防止训练后的系统在两个不同大型语言模型中再现数据偏见。
English: This study examines how Transformer-based systems can learn demographic biases from data in AI recruitment tools and proposes a privacy-enhancing framework to reduce gender information, effectively mitigating biased outcomes in two different LLMs.
Authors:Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang
Abstract:
In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
中文摘要:本研究提出了一种通过视频直接学习上下文图像编辑的可扩展方法,采用块因果扩散变换器,在多轮编辑基准测试中取得最优性能,并在多概念组合与故事生成等应用中展现出强大潜力。
English Summary: This study introduces a scalable method for learning in-context image editing directly from videos using a block-causal diffusion transformer, achieving state-of-the-art performance on multi-turn editing benchmarks while demonstrating versatile capabilities in composition and story generation.
Authors:Imanol Solano, Julian Fierrez, Aythami Morales, Alejandro Peña, Ruben Tolosana, Francisco Zamora-Martinez, Javier San Agustin
Abstract:
Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI's superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.
中文: 全面公平指数(CEI)是一种新颖的指标,通过分别分析真实和冒用分数分布来检测人脸识别系统中的细微人口统计偏差,在识别微妙差异方面优于现有方法。
English: The Comprehensive Equity Index (CEI) is a novel metric that detects subtle demographic biases in face recognition systems by separately analyzing genuine and impostor score distributions, outperforming existing methods in identifying nuanced disparities.
Authors:Yuanhao Pu, Defu Lian, Xiaolong Chen, Xu Huang, Jin Chen, Enhong Chen
Abstract:
Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM Loss often suffers from significant computational overhead and scalability limitations when applied to large-scale object spaces. To address this challenge, we propose novel loss formulations that align directly with ranking metrics: the Ranking-Generalizable \textbf{squared} (RG$^2$) Loss and the Ranking-Generalizable interactive (RG$^\times$) Loss, both derived through Taylor expansions of the SM Loss. Notably, RG$^2$ reveals the intrinsic mechanisms underlying weighted squared losses (WSL) in ranking methods and uncovers fundamental connections between sampling-based and non-sampling-based loss paradigms. Furthermore, we integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method, providing both generalization guarantees and convergence rate analyses. Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance relative to SM Loss, while significantly accelerating convergence. This framework offers the similarity learning community both theoretical insights and practically efficient tools, with methodologies applicable to a broad range of tasks where balancing ranking quality and computational efficiency is essential.
Chinese: 针对Softmax损失在大规模排序任务中的计算和可扩展性限制,我们提出了基于泰勒展开的RG²和RG×损失函数,结合ALS优化方法,在保证排序性能的同时显著提升了收敛速度。
English: To address the computational and scalability limitations of Softmax Loss in large-scale ranking tasks, we propose the RG² and RG× losses derived from Taylor expansions, which integrate with ALS optimization to achieve comparable or superior performance with faster convergence.
Authors:Zhuo Chen, Zhenya Ma, Yan Zhang, Donghua Cai, Ye Zhang, Qiushi Li, Yongheng Deng, Ye Guo, Ju Ren, Xuemin, Shen
Abstract:
Although federated learning preserves training data within local privacy domains, the aggregated model parameters may still reveal private characteristics. This vulnerability stems from clients' limited training data, which predisposes models to overfitting. Such overfitting enables models to memorize distinctive patterns from training samples, thereby amplifying the success probability of privacy attacks like membership inference. To enhance visual privacy protection in FL, we present CSVAR(Channel-Wise Spatial Image Shuffling with Variance-Guided Adaptive Region Partitioning), a novel image shuffling framework to generate obfuscated images for secure data transmission and each training epoch, addressing both overfitting-induced privacy leaks and raw image transmission risks. CSVAR adopts region-variance as the metric to measure visual privacy sensitivity across image regions. Guided by this, CSVAR adaptively partitions each region into multiple blocks, applying fine-grained partitioning to privacy-sensitive regions with high region-variances for enhancing visual privacy protection and coarse-grained partitioning to privacy-insensitive regions for balancing model utility. In each region, CSVAR then shuffles between blocks in both the spatial domains and chromatic channels to hide visual spatial features and disrupt color distribution. Experimental evaluations conducted on diverse real-world datasets demonstrate that CSVAR is capable of generating visually obfuscated images that exhibit high perceptual ambiguity to human eyes, simultaneously mitigating the effectiveness of adversarial data reconstruction attacks and achieving a good trade-off between visual privacy protection and model utility.
Federated learning models risk privacy breaches through overfitting, but the CSVAR framework counters this by adaptively shuffling image regions and channels to obscure visual features, balancing privacy and model performance.
English Summary:
Authors:Wentao Hu, Shunkai Li, Ziqiao Peng, Haoxian Zhang, Fan Shi, Xiaoqiang Liu, Pengfei Wan, Di Zhang, Hui Tian
Abstract:
Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
中文: GGTalker通过两阶段的先验适应训练策略,结合通用先验和身份特定适应,有效解决了大角度头部旋转和分布外音频的挑战,合成了具有高质量和真实感的3D说话头部。
English: GGTalker overcomes the limitations of previous methods by using a two-stage training strategy with generalizable priors and identity-specific adaptation to synthesize high-quality, photorealistic 3D talking heads that handle large head rotations and out-of-distribution audio effectively.
Authors:Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, Xihui Liu
Abstract:
AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.
中文:FilMaster是一种端到端AI系统,通过融合真实电影原则与专业制作流程,生成具备多样化镜头语言和电影节奏的高质量影片,其创新的两阶段设计与全面评估机制显著优于现有方法。
English: FilMaster is an end-to-end AI system that integrates cinematic principles and professional workflows to generate high-quality films with diverse camera language and engaging cinematic rhythm, outperforming existing methods through its innovative two-stage design and extensive evaluation.
Authors:He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou
Abstract:
Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.
中文摘要:本研究批评了智能体AI研究缺乏标准化的问题,提出了解决可复现性问题的稳健评估方案,并发布了模块化设计的开源OAgents框架以推动领域发展。
English Summary: This study critiques the lack of standardization in Agentic AI research, introduces a robust evaluation protocol to address reproducibility issues, and presents the open-source OAgents framework with modular design for advancing the field.
Authors:King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou
Abstract:
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
中文:扩展测试时计算可提升语言智能体的性能,其中并行采样、顺序修正、验证器及多样化推演等策略效果显著,尤以列表式验证和适时反思最为关键。
English: Scaling test-time compute enhances language agents' performance, with key strategies including parallel sampling, sequential revision, verifiers, and diversified rollouts, where list-wise verification and reflection timing prove most effective.
Authors:Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Abstract:
Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.
中文: TaskCraft提出了一种自动化工作流程,用于生成可扩展、多工具的代理任务,通过包含3.6万个任务的数据集,提升了提示优化和基础模型的微调效果。
English: TaskCraft introduces an automated workflow for generating scalable, multi-tool agentic tasks, enhancing prompt optimization and fine-tuning of foundation models through a dataset of 36,000 tasks.
Authors:Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, Wenhan Luo
Abstract:
Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
中文:UNIC框架通过将源视频、噪声潜变量和多模态条件标记整合为单一序列,利用DiT的注意力机制进行联合建模,并引入任务感知RoPE和条件偏置解决标记冲突,从而统一多种视频编辑任务并支持灵活的任务组合。
English: The UNIC framework unifies diverse video editing tasks by integrating source video, noisy latent, and multi-modal condition tokens into a single sequence modeled with DiT's attention, enhanced by task-aware RoPE and condition bias to resolve token collisions and support flexible task composition.
Authors:Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai
Abstract:
Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{https://fulldit2.github.io/}{https://fulldit2.github.io/}.
中文: FullDiT2通过动态令牌选择和选择性上下文缓存机制,有效解决了视频生成中上下文条件化的计算效率瓶颈,在保持高质量生成效果的同时实现了显著加速。
English: FullDiT2 addresses the computational inefficiency of in-context conditioning video generation by introducing dynamic token selection and selective context caching, achieving significant speed improvements while maintaining high-quality output.
Authors:Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
Abstract:
Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.
中文: 本文提出Context-as-Memory方法,通过将历史上下文作为记忆进行帧存储和拼接,并利用基于相机视场重叠的记忆检索模块高效选择相关帧,在交互式长视频生成中实现了优于现有技术的记忆能力。
English: This paper introduces Context-as-Memory, a novel approach that enhances interactive long video generation by using historical context as memory through frame storage and concatenation, with a memory retrieval module to efficiently select relevant frames based on camera pose overlap, achieving superior performance over state-of-the-art methods.
Authors:Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Tianfan Xue
Abstract:
Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality.
中文: CamCloneMaster提供了一个直观框架,用户无需相机参数或微调即可从参考视频中复制摄像机运动,在控制性和视觉质量上均优于现有方法。
English: CamCloneMaster offers an intuitive framework that allows users to replicate camera movements from reference videos without needing camera parameters or fine-tuning, outperforming existing methods in controllability and visual quality.
Authors:Chaoqun Du, Yulin Wang, Shiji Song, Gao Huang
Abstract:
Bayesian decision theory advocates the Bayes classifier as the optimal approach for minimizing the risk in machine learning problems. Current deep learning algorithms usually solve for the optimal classifier by \emph{implicitly} estimating the posterior probabilities, \emph{e.g.}, by minimizing the Softmax cross-entropy loss. This simple methodology has been proven effective for meticulously balanced academic benchmark datasets. However, it is not applicable to the long-tailed data distributions in the real world, where it leads to the gradient imbalance issue and fails to ensure the Bayes optimal decision rule. To address these challenges, this paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions by \emph{explicitly} modeling the parameters of the posterior probabilities and solving them with point estimation. Consequently, our method directly learns the Bayes classifier without gradient descent based on Bayes' theorem, simultaneously alleviating the gradient imbalance and ensuring the Bayes optimal decision rule. Furthermore, we propose a straightforward yet effective \emph{distribution adjustment} technique. This method enables the Bayes classifier trained from the long-tailed training set to effectively adapt to the test data distribution with an arbitrary imbalance factor, thereby enhancing performance without incurring additional computational costs. In addition, we demonstrate the gains of our method are orthogonal to existing learning approaches for long-tailed scenarios, as they are mostly designed under the principle of \emph{implicitly} estimating the posterior probabilities. Extensive empirical evaluations on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist demonstrate that our method significantly improves the generalization performance of popular deep networks, despite its simplicity.
中文: 本文提出的BAPE方法通过显式建模后验概率直接学习贝叶斯分类器,解决了长尾数据中的梯度失衡问题,并确保最优决策,无需依赖梯度下降。
English: This paper introduces BAPE, a method that explicitly models posterior probabilities to directly learn the Bayes classifier, addressing gradient imbalance and ensuring optimal decisions for long-tailed data without relying on gradient descent.
Authors:Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das
Abstract:
A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.
中文: 欧盟预测到2026年90%的在线内容可能由AI生成,加剧了虚假信息风险,而本文提出的PECCAVI水印技术通过在多通道频域中嵌入非消融点水印,能有效抵御视觉转述攻击。
English: The EU predicts that by 2026, 90% of online content may be AI-generated, heightening disinformation risks, while this paper introduces PECCAVI, a robust watermarking technique that resists visual paraphrase attacks by embedding watermarks in non-melting points and using multi-channel frequency domain methods.
Authors:Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh
Abstract:
Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms:
(i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length.
Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).
中文摘要:QuickSilver是一种模块化推理框架,通过动态停止已收敛的令牌计算、跳过冗余的KV缓存写入以及融合上下文令牌,在不改变模型权重或结构的情况下显著降低大语言模型的计算开销。
English Summary: QuickSilver is a modular inference framework that reduces computational costs in large language models by dynamically halting converged tokens, skipping redundant KV cache writes, and fusing contextual tokens without altering model weights or structure.
Authors:Chang Liu, Hongkai Chen, Yujun Cai, Hang Wu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang
Abstract:
Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often impairs rather than improves MLLMs' performance, which is a counterintuitive finding we attribute to attention dispersion and structure loss. To further substantiate our hypothesis, we propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured text induces structured attention patterns on both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. This approach significantly enhances MLLMs' document question answering performance across diverse document types without requiring architectural modifications or additional training.
中文: 本研究揭示了原始OCR文本会因注意力分散而损害多模态大语言模型的文档理解能力,并提出基于LaTex的结构保持方法,在不改变模型的情况下显著提升性能。
English: This study reveals that raw OCR text surprisingly hinders multimodal large language models' document understanding due to attention dispersion, and proposes a LaTex-based structure-preserving method that significantly improves performance without model changes.
Authors:Yang Wu, Yifan Zhang, Yiwei Wang, Yujun Cai, Yurong Wu, Yuran Wang, Ning Xu, Jian Cheng
Abstract:
While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis. Experiments across state-of-the-art LLMs reveal a strong and consistent reliance on explicit answers. The performance drops by 26.90\% when answer cues are masked, even with complete reasoning chains. These findings suggest that much of the reasoning exhibited by LLMs may reflect post-hoc rationalization rather than true inference, calling into question their inferential depth. Our study uncovers the answer-anchoring phenomenon with rigorous empirical validation and underscores the need for a more nuanced understanding of what constitutes reasoning in LLMs.
Chinese: 大语言模型表现出对显性答案线索的强烈依赖,当这些线索被屏蔽时性能显著下降,表明其推理可能只是事后合理化而非真正的推理能力。
English: Large Language Models exhibit a strong reliance on explicit answer cues, with performance dropping significantly when these are masked, suggesting their reasoning may be post-hoc rationalization rather than genuine inference.
Authors:Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang
Abstract:
Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.
中文摘要:本研究针对多模态大语言模型在图形界面代理中的幻觉问题,提出了细粒度评估框架和峰值锐度评分指标,并通过无需训练的上下文感知裁剪技术有效提升了模型性能。
English Summary: This study addresses hallucinations in multimodal large language models for GUI agents by introducing a fine-grained evaluation framework and the Peak Sharpness Score metric, while proposing Context-Aware Cropping to enhance performance without additional training.
Authors:Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
Abstract:
Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking.
To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing.
Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
中文摘要:本文提出对齐质量指数(AQI)这一几何度量方法,通过分析潜在空间中安全与不安全激活的聚类分离情况,能够检测传统行为指标所忽略的隐藏错位和越狱风险。
English Summary: The Alignment Quality Index (AQI) is introduced as a geometric metric to assess LLM alignment by analyzing activation clustering in latent space, detecting hidden misalignments and jailbreak risks beyond traditional behavioral proxies.
Authors:Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
Abstract:
Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.
中文摘要:该研究揭示了大型语言模型中存在的一种关键漏洞——潜在伪装,即对抗性提示通过模仿安全表征来逃避检测,并提出了GRACE框架,通过在潜在空间中施加几何约束来缓解这一问题。
English Summary: The study reveals a critical vulnerability in LLMs called latent camouflage, where adversarial prompts evade detection by mimicking safe representations, and introduces the GRACE framework to mitigate this by enforcing geometric constraints in the latent space.
Authors:Danush Khanna, Gurucharan Marthi Krishna Kumar, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
Abstract:
Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.
中文摘要:该研究揭示了大型语言模型中存在的一种关键漏洞——潜在伪装,即对抗性提示通过模仿安全表征来逃避检测,并提出了GRACE框架,通过在潜在空间中施加几何约束来缓解这一问题。
English Summary: The study reveals a critical vulnerability in LLMs called latent camouflage, where adversarial prompts evade detection by mimicking safe representations, and introduces the GRACE framework to mitigate this by enforcing geometric constraints in the latent space.
Authors:Liyang Chen, Yujun Cai, Jieqiong Dong, Yiwei Wang
Abstract:
Retrieval-Augmented Generation (RAG) systems require corpora that are both structurally clean and semantically coherent. BRIGHT is a recent and influential benchmark designed to evaluate complex multi-hop retrieval across diverse, high-reasoning domains. However, its practical effectiveness is limited by common web-crawled artifacts - such as content redundancy and semantic discontinuity - that impair retrieval accuracy and downstream reasoning. Notably, we find that such issues are concentrated in seven StackExchange-derived subdomains, while other domains (e.g., Coding and Theorem-based content) remain relatively clean.
In this study, we present MARCUS, a multi-agent pipeline that leverages large language models (LLMs) to systematically clean and re-chunk BRIGHT into a higher-quality corpus: BRIGHT-Plus. MARCUS applies dedicated agents for structural noise removal and semantic segmentation, preserving answer-bearing spans while improving contextual integrity. Experimental evaluations demonstrate that BRIGHT-Plus yields consistent and significant improvements in both retrieval accuracy and multi-hop reasoning across a diverse set of retrievers. We release both the BRIGHT-Plus corpus and the MARCUS pipeline to support future research on robust, reasoning-centric retrieval.
中文摘要:本研究提出MARCUS多智能体流程,通过清理BRIGHT基准创建BRIGHT-Plus语料库,有效解决网络爬取数据中的结构噪声和语义断裂问题,显著提升了检索准确性和多跳推理能力。
English Summary: The study introduces MARCUS, a multi-agent pipeline that cleans the BRIGHT benchmark to create BRIGHT-Plus, enhancing retrieval accuracy and multi-hop reasoning by addressing structural and semantic issues in web-crawled data.
Authors:Sifan Li, Yujun Cai, Yiwei Wang
Abstract:
Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
中文: 视觉语言模型在检测图像隐藏内容方面表现不佳,在HC-Bench基准测试中准确率接近零,但通过名为SemVink的简单图像缩放技术,将准确率提升至99%以上,揭示了需要结合多尺度处理的混合模型来弥合与人类认知的差距。
English: Vision-language models struggle with detecting hidden content in images, achieving near-zero accuracy on the HC-Bench benchmark, but a simple scaling technique called SemVink boosts accuracy to over 99% by reducing visual noise, highlighting the need for hybrid models that combine multi-scale processing to bridge the gap with human cognition.
Authors:Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.
RLVR enhances LLM reasoning by primarily optimizing high-entropy "forking tokens" that guide reasoning pathways, with experiments showing that updating just 20% of tokens achieves superior performance compared to full-gradient updates.
English Summary:
Authors:Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Abstract:
Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.
中文: 本研究提出了一种基于兼容性感知参数拼接的统一参数集成框架,能够实现领域专用多模态大语言模型的模块化能力组合,在保持结构模块化的同时以最小开销完成高效知识融合,并在多模态基准测试中验证了其有效性。
English: This study introduces a unified parameter integration framework using Compatibility-Aware Parameter Splicing to enable modular composition of domain-specific multimodal large language models, achieving efficient knowledge fusion with minimal overhead while preserving structural modularity across diverse benchmarks.
Authors:Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren
Abstract:
Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.
Chinese: SurgVidLM是一种创新的视频语言模型,通过两阶段聚焦机制和大规模训练数据实现了精细化的手术视频理解,在整体和细节手术场景分析方面显著优于现有方法。
English: SurgVidLM is a novel video language model that addresses fine-grained surgical video comprehension through a two-stage StageFocus mechanism and large-scale training data, significantly outperforming existing methods in both holistic and detailed surgical scene analysis.
Authors:Xiting He, Mingwu Su, Xinqi Jiang, Long Bai, Jiewen Lai, Hongliang Ren
Abstract:
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.
中文:视觉-语言-动作模型在医疗应用中展现出人机直观交互的潜力,CapsDT模型通过26.25%的真实场景模拟成功率,在内镜任务中实现了最先进的性能表现。
English: Vision-Language-Action models show promise for intuitive human-robot interaction in medical applications, with CapsDT demonstrating state-of-the-art performance in endoscopy tasks through a 26.25% success rate in real-world simulations.
Authors:Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang
Abstract:
As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91\% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.
中文:OmniBench提出了一种自生成、跨平台的图结构基准,通过子任务组合自动合成可控复杂度的任务,而OmniEval则提供多维度评估框架,其图结构数据在高效指导智能体和人类高接受度方面表现卓越。
English: OmniBench introduces a self-generating, cross-platform benchmark with automated task synthesis for controllable complexity, while OmniEval provides a multidimensional evaluation framework, achieving high human acceptance and efficient agent training through graph-structured data.
Authors:Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Siyuan Li, Yufei Huang, Stan Z. Li
Abstract:
The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near-experimental accuracy, positioning it as a valuable resource for data-driven protein design. However, its direct use in training deep models that are sensitive to fine-grained atomic geometry, such as inverse folding, exposes a critical limitation. Comparative analysis of structural feature distributions reveals that AFDB structures exhibit distinct statistical regularities, reflecting a systematic geometric bias that deviates from the conformational diversity found in experimentally determined structures from the Protein Data Bank (PDB). While AFDB structures are cleaner and more idealized, PDB structures capture the intrinsic variability and physical realism essential for generalization in downstream tasks. To address this discrepancy, we introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries. By training the model to recover plausible structural states, DeSAE implicitly captures a more robust and natural structural manifold. At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance across multiple benchmarks. This work highlights the critical impact of subtle systematic biases in predicted structures and presents a principled framework for debiasing, significantly boosting the performance of structure-based learning tasks like inverse folding.
中文:AlphaFold数据库提供高精度蛋白质结构,但存在系统性几何偏差,影响其在逆向折叠等任务中的效果;为此引入去偏结构自编码器(DeSAE),通过重构类天然构象显著提升结构学习任务的性能。
English: The AlphaFold Database provides high-accuracy protein structures but contains systematic geometric biases that limit its effectiveness for inverse folding tasks, which is addressed by introducing a Debiasing Structure AutoEncoder (DeSAE) to reconstruct native-like conformations and improve performance in structure-based learning.
Authors:Guankun Wang, Rui Tang, Mengya Xu, Long Bai, Huxin Gao, Hongliang Ren
Abstract:
Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low-Rank Adaptation to facilitate efficient fine-tuning while incorporating Task Efficient Shared Low-Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially-Aware Multi-Scale Attention that enhances feature representation discrimination by enabling cross-spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis-Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI-driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.
中文: 本文提出EndoARSS框架,基于DINOv2模型结合低秩适应与空间感知注意力机制,通过多任务学习提升内窥镜手术中的活动识别与语义分割效果,并在新数据集上验证了其卓越性能。
English: The paper introduces EndoARSS, a multi-task learning framework based on DINOv2 that enhances activity recognition and semantic segmentation in endoscopic surgery by integrating low-rank adaptation and spatially-aware attention, validated through novel datasets and superior benchmark performance.
Authors:Kaihang Pan, Wendong Bu, Yuruo Wu, Yang Wu, Kai Shen, Yunfei Li, Hang Zhao, Juncheng Li, Siliang Tang, Yueting Zhuang
Abstract:
Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
中文:近期自回归文本生成图像模型在细粒度语义对齐上存在不足,为此提出FocusDiff方法,通过配对数据训练和强化学习增强精准控制,实现了最优性能表现。
English: Recent autoregressive text-to-image models struggle with fine-grained semantic alignment, prompting the development of FocusDiff, which uses paired training data and reinforcement learning to enhance precision and achieve state-of-the-art results.
Authors:Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang
Abstract:
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
中文: 本文提出一种多模态大语言模型中视觉理解与生成的协同进化方法,通过两阶段训练实现迭代式内省图像生成,显著提升了文本到图像生成、图像编辑及语义评估的综合能力。
English: This paper introduces a collaborative co-evolution approach for MLLMs that integrates visual comprehension and generation through a two-stage training process, enabling iterative introspective image generation and enhancing performance in text-to-image tasks, image editing, and semantic evaluation.
Authors:Zhiyuan Liang, Dongwen Tang, Yuhao Zhou, Xuanlei Zhao, Mingjia Shi, Wangbo Zhao, Zekai Li, Peihao Wang, Konstantin Schürholt, Damian Borth, Michael M. Bronstein, Yang You, Zhangyang Wang, Kai Wang
Abstract:
Modern Parameter-Efficient Fine-Tuning (PEFT) methods such as low-rank adaptation (LoRA) reduce the cost of customizing large language models (LLMs), yet still require a separate optimization run for every downstream dataset. We introduce \textbf{Drag-and-Drop LLMs (\textit{DnD})}, a prompt-conditioned parameter generator that eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates. A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices. Once trained in a diverse collection of prompt-checkpoint pairs, DnD produces task-specific parameters in seconds, yielding i) up to \textbf{12,000$\times$} lower overhead than full fine-tuning, ii) average gains up to \textbf{30\%} in performance over the strongest training LoRAs on unseen common-sense reasoning, math, coding, and multimodal benchmarks, and iii) robust cross-domain generalization despite never seeing the target data or labels. Our results demonstrate that prompt-conditioned parameter generation is a viable alternative to gradient-based adaptation for rapidly specializing LLMs. Our project is available at \href{https://jerryliang24.github.io/DnD}{https://jerryliang24.github.io/DnD}.
中文: 提出的拖拽式大语言模型(DnD)通过从任务提示直接生成LoRA参数,无需逐任务训练,比全参数微调快12,000倍,并在推理、数学和编程任务中实现性能提升。
English: The proposed Drag-and-Drop LLMs (DnD) method eliminates per-task training by generating LoRA parameters directly from task prompts, achieving up to 12,000× faster adaptation than fine-tuning while improving performance across reasoning, math, and coding tasks.
Authors:Ji Zhang, Jingkuan Song, Lianli Gao, Nicu Sebe, Heng Tao Shen
Abstract:
Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target task.Nevertheless, existing approaches may still fail in the open world due to the inevitable in-distribution (ID) and out-of-distribution (OOD) noise from both support and query samples of the target task. With limited support samples available, i) the adverse effect of the dual noises can be severely amplified during task adaptation, and ii) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose DEnoised Task Adaptation (DETA++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a clean prototype loss and a noise entropy maximization loss are proposed to achieve noise-robust task adaptation. Additionally,DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model's robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.
中文: 预训练模型的最新进展通过任务适应实现了少样本学习,但现有方法难以应对分布内和分布外噪声,而DETA++采用对比相关性聚合、记忆库和类内区域交换策略,实现了鲁棒的任务适应和预测。
English: Recent advances in pre-trained models enable few-shot learning through task adaptation, but existing methods struggle with in-distribution and out-of-distribution noise, which DETA++ addresses using contrastive relevance aggregation, a memory bank, and intra-class region swapping to achieve robust adaptation and predictions.
Authors:Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng
Abstract:
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.
Chinese: NTU Speechlab系统在Interspeech 2025多语言对话语音与语言模型挑战赛中荣获第五名,通过模型架构优化和训练策略创新,将混合错误率从20.2%显著降低至10.6%。
English: The NTU Speechlab system secured 5th place in the Interspeech 2025 MLC-SLM Challenge by reducing the average Mix Error Rate from 20.2% to 10.6% through innovations in model architecture, data selection, and training strategies.
Authors:Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
Abstract:
Teaching large language models (LLMs) to use tools is crucial for improving their problem-solving abilities and expanding their applications. However, effectively using tools is challenging because it requires a deep understanding of tool functionalities and user intentions. Previous methods relied mainly on LLMs to generate instruction data, but the quality of these data was often insufficient. In this paper, we propose a new method that uses knowledge graphs to generate high-quality instruction data for LLMs. Knowledge graphs are manually curated datasets rich in semantic information. We begin by extracting various query pathways from a given knowledge graph, which are transformed into a broad spectrum of user queries. We then translate the relationships between entities into actionable tools and parse the pathways of each query into detailed solution steps, thereby creating high-quality instruction data. Our experiments show that fine-tuning on just a small sample of this synthetic data can significantly improve the tool utilization and overall capabilities of LLMs.
Chinese: 本文提出一种利用知识图谱生成高质量指令数据的方法,通过对此合成数据进行微调,显著提升大语言模型的工具使用能力和整体性能。
English: This paper introduces a method using knowledge graphs to generate high-quality instruction data for large language models, enhancing their tool utilization and problem-solving abilities through fine-tuning on this synthetic data.
Authors:Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao
Abstract:
We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.
中文:SciVer是首个专为评估基础模型在多模态科学语境中验证声明能力而设计的基准,通过对21个前沿模型的测试发现其与人类专家存在显著性能差距。
English: SciVer is the first benchmark designed to assess foundation models' ability to verify claims in multimodal scientific contexts, revealing a significant performance gap between current models and human experts through evaluation of 21 leading systems.
Authors:Ling Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
Abstract:
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
中文摘要:Ring-lite是一个基于专家混合架构并通过强化学习优化的大语言模型,通过创新的训练方法解决了优化稳定性与领域冲突问题,在仅激活三分之一参数的情况下实现了最先进的推理性能。
English Summary: Ring-lite is a reinforcement learning-optimized Mixture-of-Experts model that achieves state-of-the-art reasoning performance while activating only one-third of parameters, through novel training techniques addressing optimization stability and domain conflicts.
Authors:Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
Abstract:
Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
中文摘要:MultiFinBen是首个面向全球金融领域的多语言多模态基准测试,通过引入跨语言推理和多模态整合的新型任务,揭示了即使最先进的模型在复杂金融场景中仍存在显著不足。
English Summary: MultiFinBen is the first multilingual and multimodal benchmark designed for the global financial domain, introducing novel tasks that challenge models with complex cross-lingual reasoning and multimodal integration, where even top-performing models show significant limitations.
Authors:Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou
Abstract:
In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.
中文摘要:本文提出偏置检索与奖励攻击框架,揭示检索增强生成系统中的投毒攻击会加剧语言模型偏见,并通过双阶段防御机制有效缓解此类攻击影响。
English Summary: This paper introduces the Bias Retrieval and Reward Attack (BRRA) framework, demonstrating how poisoning attacks in Retrieval-Augmented Generation systems can amplify language model biases, and proposes a dual-stage defense mechanism to counter these attacks.
Authors:Hongjun Liu, Yilun Zhao, Arman Cohan, Chen Zhao
Abstract:
Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.
中文: 本研究提出SUCEA模块化框架,通过分解对抗性声明、迭代检索证据和聚合信息来优化事实核查,显著提升了检索和标签预测的准确性,优于现有基准方法。
English: This study introduces SUCEA, a modular framework that enhances fact-checking by rephrasing adversarial claims through segmentation, iterative evidence retrieval, and aggregation, significantly improving accuracy in retrieval and label prediction over existing methods.
Authors:Yinuo Wang, Robert E. Mercer, Frank Rudzicz, Sudipta Singha Roy, Pengjie Ren, Zhumin Chen, Xindi Wang
Abstract:
Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
中文摘要:本综述系统研究基于大语言模型的医疗问答系统在六个可信维度上的表现,分析评估标准和改进方法,并为实现安全可靠的临床应用指明未来研究方向。
English Summary: This survey systematically explores six key dimensions of trustworthiness in medical QA systems using large language models, analyzing evaluation benchmarks and improvement techniques while identifying future research directions for reliable clinical deployment.
Authors:Menglin Zhao, Zhuorui Yong, Ruijia Guan, Kai-Wei Chang, Adrian Haimovich, Kei Ouchi, Timothy Bickmore, Bingsheng Yao, Dakuo Wang, Smit Desai
Abstract:
Serious illness conversations (SICs), discussions between clinical care teams and patients with serious, life-limiting illnesses about their values, goals, and care preferences, are critical for patient-centered care. Without these conversations, patients often receive aggressive interventions that may not align with their goals. Clinical care teams face significant barriers when conducting serious illness conversations with older adult patients in Emergency Department (ED) settings, where most older adult patients lack documented treatment goals. To understand current practices and identify AI support opportunities, we conducted interviews with two domain experts and nine ED clinical care team members. Through thematic analysis, we characterized a four-phase serious illness conversation workflow (identification, preparation, conduction, documentation) and identified key needs and challenges at each stage. Clinical care teams struggle with fragmented EHR data access, time constraints, emotional preparation demands, and documentation burdens. While participants expressed interest in AI tools for information synthesis, conversational support, and automated documentation, they emphasized preserving human connection and clinical autonomy. We present design guidelines for AI tools supporting SIC workflows that fit within existing clinical practices. This work contributes empirical understanding of ED-based serious illness conversations and provides design considerations for AI in high-stakes clinical environments.
中文摘要:本研究探讨了在急诊科开展重病对话的挑战及人工智能工具的辅助机遇,强调在解决数据碎片化和文书负担等流程障碍的同时,必须保持人文关怀与临床自主性。
English Summary: This study explores the challenges and opportunities for AI tools in supporting serious illness conversations in emergency departments, emphasizing the need to maintain human connection while addressing workflow barriers like fragmented data and documentation burdens.
Authors:Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
Abstract:
In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V^*$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V^*$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V^*$" but also leads to simplified output images that fail to capture the intended concept.
We identify the root cause as unconstrained optimisation, which allows the learned embedding $V^*$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://anonymous.4open.science/r/Embedding-Adjustment.
中文摘要:本文针对生成式个性化中的语义坍缩问题,提出无需训练的嵌入调整方法,有效防止学习到的视觉概念主导多概念提示,从而保持语义丰富性并提升图文对齐效果。
English Summary: This paper addresses the semantic collapsing issue in generative personalization where learned visual concepts dominate multi-concept prompts, proposing a training-free embedding adjustment method to maintain semantic richness and improve text-image alignment.
Authors:Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
Abstract:
In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.
中文: KaLM-Embedding-V2模型系列通过先进的训练技术和高品质数据实现了最先进的性能,为十亿参数以下的紧凑型嵌入模型设立了新标杆。
English: The KaLM-Embedding-V2 model series achieves state-of-the-art performance through superior training techniques and high-quality data, setting a new standard for compact embedding models under 1B parameters.
Authors:Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
Abstract:
Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models will be publicly available to facilitate academic research.
中文: KaLM-Embedding-V2模型系列通过先进的训练技术和高品质数据实现了最先进的性能,为十亿参数以下的紧凑型嵌入模型设立了新标杆。
English: The KaLM-Embedding-V2 model series achieves state-of-the-art performance through superior training techniques and high-quality data, setting a new standard for compact embedding models under 1B parameters.
Authors:Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe
Abstract:
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research
中文: 本文介绍了OpusLMs系列开放基础语音语言模型,这些模型在语音识别、语音合成和文本任务中表现出可比甚至更优的性能,且完全基于公开材料构建,具有完全透明性。
English: This paper introduces OpusLMs, a family of open foundational speech language models up to 7B parameters, which achieve comparable or superior performance in speech recognition, synthesis, and text tasks while being fully transparent and built from publicly available materials.
Authors:Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei
Abstract:
Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LLM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.
中文: 本研究将强化学习中的熵与大语言模型的探索性推理相关联,提出一种基于熵的优势函数简单修改方法,显著提升了推理深度和Pass@K指标性能。
English: This study links entropy in reinforcement learning to exploratory reasoning in large language models, proposing a simple entropy-based advantage modification that significantly enhances reasoning depth and performance on the Pass@K metric.
Authors:Aishan Liu, Zonghao Ying, Le Wang, Junjie Mu, Jinyang Guo, Jiakai Wang, Yuqing Ma, Siyuan Liang, Mingchuan Zhang, Xianglong Liu, Dacheng Tao
Abstract:
The rapid advancement of vision-language models (VLMs) and their integration into embodied agents have unlocked powerful capabilities for decision-making. However, as these systems are increasingly deployed in real-world environments, they face mounting safety concerns, particularly when responding to hazardous instructions. In this work, we propose AGENTSAFE, the first comprehensive benchmark for evaluating the safety of embodied VLM agents under hazardous instructions. AGENTSAFE simulates realistic agent-environment interactions within a simulation sandbox and incorporates a novel adapter module that bridges the gap between high-level VLM outputs and low-level embodied controls. Specifically, it maps recognized visual entities to manipulable objects and translates abstract planning into executable atomic actions in the environment. Building on this, we construct a risk-aware instruction dataset inspired by Asimovs Three Laws of Robotics, including base risky instructions and mutated jailbroken instructions. The benchmark includes 45 adversarial scenarios, 1,350 hazardous tasks, and 8,100 hazardous instructions, enabling systematic testing under adversarial conditions ranging from perception, planning, and action execution stages.
中文摘要:AGENTSAFE作为首个全面评估具身视觉语言模型智能体在危险指令下安全性的基准,通过模拟环境和创新适配模块连接高级决策与可执行动作,构建了包含对抗场景与风险指令的系统测试体系。
English Summary: The AGENTSAFE benchmark is introduced as the first comprehensive framework for evaluating embodied vision-language model agents' safety against hazardous instructions, featuring simulated environments and a novel adapter module to bridge high-level decisions with executable actions.
Authors:Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Aishan Liu, Yuliang Lu, Xitong Gao, Ee-Chien Chang
Abstract:
Mobile agents powered by vision-language models (VLMs) are increasingly adopted for tasks such as UI automation and camera-based assistance. These agents are typically fine-tuned using small-scale, user-collected data, making them susceptible to stealthy training-time threats. This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents. The attack injects malicious behaviors into the model by modifying only the visual input while preserving textual prompts and instructions, achieving stealth through the complete absence of textual anomalies. Once the agent is fine-tuned on this poisoned data, adding a predefined visual pattern (trigger) at inference time activates the attacker-specified behavior (backdoor). Our attack aligns the training gradients of poisoned samples with those of an attacker-specified target instance, effectively embedding backdoor-specific features into the poisoned data. To ensure the robustness and stealthiness of the attack, we design three trigger variants that better resemble real-world scenarios: static patches, dynamic motion patterns, and low-opacity blended content. Extensive experiments on six Android applications and three mobile-compatible VLMs demonstrate that our attack achieves high success rates (ASR up to 94.67%) while preserving clean-task behavior (FSR up to 95.85%). We further conduct ablation studies to understand how key design factors impact attack reliability and stealth. These findings is the first to reveal the security vulnerabilities of mobile agents and their susceptibility to backdoor injection, underscoring the need for robust defenses in mobile agent adaptation pipelines.
中文: 本研究提出VIBMA,一种针对视觉语言模型移动代理的隐形文本清洁后门攻击,通过仅修改视觉输入注入恶意行为,在保持正常功能的同时实现高达94.67%的攻击成功率。
English: This study presents VIBMA, a stealthy clean-text backdoor attack that manipulates visual inputs to compromise VLM-based mobile agents, achieving high attack success while maintaining normal function through imperceptible triggers.
Authors:Yongrui Chen, Zhiqiang Liu, Jing Yu, Lin Ren, Nan Hu, Xinbang Dai, Jiajun Liu, Jiazhen Kang, Shenyu Zhang, Xinda Wang, Keyan Ding, Pengfei Shen, Haolei Zhu, Hongjie Deng, Yisong Wang, Tongtong Wu, Sheng Bi, Wen Zhang, Tianxing Wu, Qiu Ji, Haofen Wang, Wenliang Chen, Huajun Chen, Guilin Qi
Abstract:
Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, \textsc{OneEval}\textsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) \emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2\% accuracy on \textsc{OneEval}\textsubscript{Hard}; b) \emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53\% (textual reasoning) to 25\% (formal logic); and c) \emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the \textsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
中文: 大语言模型在需要整合结构化外部知识的推理任务中表现不佳,为此我们推出了OneEval基准测试,系统评估其在不同知识模态下的表现,结果揭示了模型在结构化推理方面存在持续局限。
English: Large Language Models struggle with reasoning tasks that require integrating structured external knowledge, prompting the creation of the OneEval benchmark to systematically evaluate their performance across diverse knowledge modalities and revealing persistent limitations in structured reasoning.
Authors:Tung-Long Vuong, Hoang Phan, Vy Vo, Anh Bui, Thanh-Toan Do, Trung Le, Dinh Phung
Abstract:
Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.
中文: 本文提出了一种新颖方法,通过利用视觉和文本嵌入之间的几何关系来增强无监督域适应中的伪标签和目标提示学习,解决了多模态模型中嵌入分布偏差的问题。
English: This paper introduces a novel method to enhance pseudo-labels and target-prompt learning in unsupervised domain adaptation by leveraging the geometric relationships between visual and text embeddings, addressing the limitation of embedding distribution deviations in multi-modal models like CLIP.
Authors:Eduardo Baena, Paolo Testolina, Michele Polese, Sergi Aliaga, Andrew Benincasa, Dimitrios Koutsonikolas, Josep Jornet, Tommaso Melodia
Abstract:
Lunar surface operations impose stringent requirements on wireless communication systems, including autonomy, robustness to disruption, and the ability to adapt to environmental and mission-driven context. While Space-O-RAN provides a distributed orchestration model aligned with 3GPP standards, its decision logic is limited to static policies and lacks semantic integration. We propose a novel extension incorporating a semantic agentic layer enabled by the Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication protocols, allowing context-aware decision making across real-time, near-real-time, and non-real-time control layers. Distributed cognitive agents deployed in rovers, landers, and lunar base stations implement wireless-aware coordination strategies, including delay-adaptive reasoning and bandwidth-aware semantic compression, while interacting with multiple MCP servers to reason over telemetry, locomotion planning, and mission constraints.
中文: 该方案在Space-O-RAN中引入基于MCP和A2A协议的语义智能体层,使月球设备能通过分布式认知代理实现动态的情境感知无线协调与自适应通信。
English: The proposed extension to Space-O-RAN introduces a semantic agentic layer using MCP and A2A protocols, enabling dynamic, context-aware wireless coordination across lunar assets for adaptive communication strategies.
Authors:Feiyu Yang, Siyuan Liang, Aishan Liu, Dacheng Tao
Abstract:
The capability of generative diffusion models (DMs) like Stable Diffusion (SD) in replicating training data could be taken advantage of by attackers to launch the Copyright Infringement Attack, with duplicated poisoned image-text pairs. SilentBadDiffusion (SBD) is a method proposed recently, which shew outstanding performance in attacking SD in text-to-image tasks. However, the feasible data resources in this area are still limited, some of them are even constrained or prohibited due to the issues like copyright ownership or inappropriate contents; And not all of the images in current datasets are suitable for the proposed attacking methods; Besides, the state-of-the-art (SoTA) performance of SBD is far from ideal when few generated poisoning samples could be adopted for attacks. In this paper, we raised new datasets accessible for researching in attacks like SBD, and proposed Multi-Element (ME) attack method based on SBD by increasing the number of poisonous visual-text elements per poisoned sample to enhance the ability of attacking, while importing Discrete Cosine Transform (DCT) for the poisoned samples to maintain the stealthiness. The Copyright Infringement Rate (CIR) / First Attack Epoch (FAE) we got on the two new datasets were 16.78% / 39.50 and 51.20% / 23.60, respectively close to or even outperformed benchmark Pokemon and Mijourney datasets. In condition of low subsampling ratio (5%, 6 poisoned samples), MESI and DCT earned CIR / FAE of 0.23% / 84.00 and 12.73% / 65.50, both better than original SBD, which failed to attack at all.
中文: 本文提出了新的数据集和基于多元素攻击的方法,结合离散余弦变换增强了对生成扩散模型的版权侵权攻击效果与隐蔽性,在少量投毒样本下仍取得了优于基准的性能。
English: This paper introduces new datasets and a Multi-Element attack method enhanced with Discrete Cosine Transform to improve the effectiveness and stealthiness of copyright infringement attacks on generative diffusion models, achieving competitive results even with limited poisoned samples.
Authors:Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang
Abstract:
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
中文: AniMaker是一个多智能体框架,通过高效的多候选片段生成和故事感知选择克服视频生成中的挑战,利用专门智能体和创新技术组件,仅从文本输入即可创建全局一致且故事连贯的动画。
English: AniMaker is a multi-agent framework that overcomes challenges in video generation by enabling efficient multi-candidate clip generation and storytelling-aware selection, ensuring globally consistent animations from text input through specialized agents and novel technical components.
Authors:Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang
Abstract:
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
中文: AniMaker是一个多智能体框架,通过高效的多候选片段生成和故事感知选择克服视频生成中的挑战,利用专门智能体和创新技术组件,仅从文本输入即可创建全局一致且故事连贯的动画。
English: AniMaker is a multi-agent framework that overcomes challenges in video generation by enabling efficient multi-candidate clip generation and storytelling-aware selection, ensuring globally consistent animations from text input through specialized agents and novel technical components.
Authors:Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
Abstract:
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Chinese: 本文系统性地回顾和评估了离散音频标记器在语音、音乐和通用音频领域的表现,通过重构、下游任务和声学语言建模的基准测试,揭示了当前方法的局限性并为未来研究提供了指导方向。
English: This paper provides a systematic review and benchmark of discrete audio tokenizers across speech, music, and general audio domains, evaluating their performance on reconstruction, downstream tasks, and acoustic language modeling while highlighting limitations and future research directions.
Authors:Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
Abstract:
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Chinese: 本文系统性地回顾和评估了离散音频标记器在语音、音乐和通用音频领域的表现,通过重构、下游任务和声学语言建模的基准测试,揭示了当前方法的局限性并为未来研究提供了指导方向。
English: This paper provides a systematic review and benchmark of discrete audio tokenizers across speech, music, and general audio domains, evaluating their performance on reconstruction, downstream tasks, and acoustic language modeling while highlighting limitations and future research directions.
Authors:Zhe Wang, Jiayi Zhang, Bokai Xu, Wenhui Yi, Emil Björnson, Bo Ai
Abstract:
To enable next-generation wireless communication networks with modest spectrum availability, multiple-input multiple-output (MIMO) technology needs to undergo further evolution. In this paper, we introduce a promising next-generation wireless communication concept: flexible MIMO technology. This technology represents a MIMO technology with flexible physical configurations and integrated applications. We categorize twelve representative flexible MIMO technologies into three major classifications: flexible deployment characteristics-based, flexible geometry characteristics-based, and flexible real-time modifications-based. Then, we provide a comprehensive overview of their fundamental characteristics, potential, and challenges. Furthermore, we demonstrate three vital enablers for the flexible MIMO technology, including efficient channel state information (CSI) acquisition schemes, low-complexity beamforming design, and explainable artificial intelligence (AI)-enabled optimization. Within these areas, eight critical sub-enabling technologies are discussed in detail. Finally, we present two case studies-pre-optimized irregular arrays and cell-free movable antennas-where significant potential for flexible MIMO technologies to enhance the system capacity is showcased.
中文摘要:本文提出灵活MIMO技术作为传统MIMO系统的演进方向,将十二种代表性技术分为三大类,通过案例研究展示了其在系统容量提升方面的潜力,并详细讨论了信道信息获取、波束成形设计及人工智能优化等关键使能技术。
English Summary: This paper introduces flexible MIMO technology as an evolution of traditional MIMO systems, categorizing twelve variants into three classifications and analyzing their characteristics, enablers like AI optimization, and capacity-enhancing potential through case studies.
Authors:Ngoc-Quan Pham, Tuan Truong, Quyen Tran, Tan Nguyen, Dinh Phung, Trung Le
Abstract:
We introduce Interactive Bayesian Distributional Robustness (IBDR), a novel Bayesian inference framework that allows modeling the interactions between particles, thereby enhancing ensemble quality through increased particle diversity. IBDR is grounded in a generalized theoretical framework that connects the distributional population loss with the approximate posterior, motivating a practical dual optimization procedure that enforces distributional robustness while fostering particle diversity. We evaluate IBDR's performance against various baseline methods using the VTAB-1K benchmark and the common reasoning language task. The results consistently show that IBDR outperforms these baselines, underscoring its effectiveness in real-world applications.
中文摘要:IBDR是一种新型贝叶斯推理框架,通过建模粒子间相互作用并增强多样性来提升集成质量,在基准测试中 consistently 优于现有方法。
English Summary: IBDR is a new Bayesian inference framework that improves ensemble quality by modeling particle interactions and promoting diversity, demonstrating superior performance over baselines in benchmark tests.
Authors:Yangqin Jiang, Xubin Ren, Lianghao Xia, Da Luo, Kangyi Lin, Chao Huang
Abstract:
This work addresses a fundamental barrier in recommender systems: the inability to generalize across domains without extensive retraining. Traditional ID-based approaches fail entirely in cold-start and cross-domain scenarios where new users or items lack sufficient interaction history. Inspired by foundation models' cross-domain success, we develop a foundation model for sequential recommendation that achieves genuine zero-shot generalization capabilities. Our approach fundamentally departs from existing ID-based methods by deriving item representations exclusively from textual features. This enables immediate embedding of any new item without model retraining. We introduce unified item tokenization with Finite Scalar Quantization that transforms heterogeneous textual descriptions into standardized discrete tokens. This eliminates domain barriers that plague existing systems. Additionally, the framework features hybrid bidirectional-causal attention that captures both intra-item token coherence and inter-item sequential dependencies. An efficient catalog-aware beam search decoder enables real-time token-to-item mapping. Unlike conventional approaches confined to their training domains, RecGPT naturally bridges diverse recommendation contexts through its domain-invariant tokenization mechanism. Comprehensive evaluations across six datasets and industrial scenarios demonstrate consistent performance advantages.
中文: 本研究提出了一种序列推荐基础模型,通过基于文本的物品表征和统一标记化实现零样本泛化,无需重新训练即可跨越领域障碍。
English: This study introduces a foundation model for sequential recommendation that achieves zero-shot generalization by using text-based item representations and unified tokenization, overcoming domain barriers without retraining.
Authors:Ze Yu Zhang, Zitao Li, Yaliang Li, Bolin Ding, Bryan Kian Hsiang Low
Abstract:
Retrieval-augmented generation (RAG) based on large language models often falters on narrative documents with inherent temporal structures. Standard unstructured RAG methods rely solely on embedding-similarity matching and lack any general mechanism to encode or exploit chronological information, while knowledge graph RAG (KG-RAG) frameworks collapse every mention of an entity into a single node, erasing the evolving context that drives many queries. To formalize this challenge and draw the community's attention, we construct ChronoQA, a robust and discriminative QA benchmark that measures temporal, causal, and character consistency understanding in narrative documents (e.g., novels) under the RAG setting. We then introduce Entity-Event RAG (E^2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping, thereby preserving the temporal and causal facets needed for fine-grained reasoning. Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries. E^2RAG therefore offers a practical path to more context-aware retrieval for tasks that require precise answers grounded in chronological information.
中文: 针对现有检索增强生成技术难以处理时序叙事的问题,我们构建了ChronoQA基准测试并提出实体-事件双图框架E²RAG,通过保持时间与因果关系显著提升了因果推理与角色一致性查询的准确性。
English: Current retrieval-augmented generation struggles with temporal narratives, so we created ChronoQA benchmark and Entity-Event RAG, a dual-graph framework that outperforms existing methods by preserving chronological context for precise reasoning.
Authors:Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, Nanyun Peng
Abstract:
Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.
中文摘要:DiCoRe通过发散-收敛推理框架,将开放式事件发现与有限状态机约束解码相结合,在零样本事件检测任务中实现了跨领域和语言模型的显著性能提升。
English Summary: DiCoRe is a divergent-convergent reasoning framework that enhances zero-shot event detection by combining open-ended event discovery with constrained decoding, achieving significant performance improvements across multiple domains and language models.
Authors:Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Aishan Liu, Dacheng Tao
Abstract:
Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.
中文摘要:视觉语言模型易受隐蔽后门攻击影响导致恶意图像描述,而提出的语义奖励防御框架通过强化学习干扰恶意激活路径,在保持正常描述质量的同时将攻击成功率降至5.6%。
English Summary: Vision-Language Models are susceptible to stealthy backdoor attacks that manipulate image captions, but the proposed Semantic Reward Defense framework effectively counters these threats by using reinforcement learning to disrupt malicious activations while maintaining caption quality.
Authors:Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, Ying Tai
Abstract:
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, Î(40K) video clips and Î(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
中文: 本研究提出了MotionSight,一种利用视觉提示增强视频细粒度运动理解的零样本方法,并发布了大规模数据集MotionVid-QA,在该领域实现了最先进的性能。
English: This study introduces MotionSight, a novel zero-shot method using visual prompts to enhance fine-grained motion understanding in videos, and presents MotionVid-QA, a large-scale dataset for this purpose, achieving state-of-the-art performance.
Authors:Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, Ying Tai
Abstract:
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, Θ(40K) video clips and Θ(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
中文: 本研究提出了MotionSight,一种利用视觉提示增强视频细粒度运动理解的零样本方法,并发布了大规模数据集MotionVid-QA,在该领域实现了最先进的性能。
English: This study introduces MotionSight, a novel zero-shot method using visual prompts to enhance fine-grained motion understanding in videos, and presents MotionVid-QA, a large-scale dataset for this purpose, achieving state-of-the-art performance.
Authors:Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Abstract:
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
Chinese: 提出的思维链策略通过将对话训练与多模态语言模型预训练对齐,有效提升了端到端口语对话系统的性能,在有限数据上实现了显著改进并保持计算高效性。
English: The proposed chain-of-thought strategy enhances end-to-end spoken dialogue systems by aligning conversational training with multimodal language model pre-training, achieving significant performance gains with computational efficiency on limited data.
Authors:Yue Cui, Liuyi Yao, Shuchang Tao, Weijie Shi, Yaliang Li, Bolin Ding, Xiaofang Zhou
Abstract:
Large language models (LLMs) have significantly advanced natural language processing, particularly through the integration of external tools and APIs. However, their effectiveness is frequently hampered by parameter mis-filling during tool calling. In this paper, we propose the Hierarchical Tool Error Checklist (HiTEC) framework to systematically diagnose and mitigate tool-calling errors without relying on extensive real-world interactions. HiTEC introduces a two-tiered approach: a global error checklist that identifies common, cross-tool issues, and a local error checklist that targets tool-specific and contextual failures. Building on this structure, we propose two deployments: HiTEC-In Context Learning (HiTEC-ICL) and HiTEC-Kahneman-Tversky Optimization (HiTEC-KTO). HiTEC-ICL embeds the global checklist in the initial prompts and leverages a two-round conversational interaction to dynamically refine parameter handling, while HiTEC-KTO generates high-quality negative examples to drive fine-tuning via preference-based optimization. Extensive experiments across five public datasets demonstrate that our framework significantly improves parameter-filling accuracy and tool-calling success rates compared to baseline methods.
中文:HiTEC框架通过分层错误检查表和双重部署策略,系统性地提升了大语言模型调用工具的准确性,在参数填充精度上显著优于现有方法。
English: The HiTEC framework systematically enhances LLM tool-calling accuracy through hierarchical error checklists and dual deployment strategies, significantly outperforming existing methods in parameter-filling precision.
Authors:Xinquan Wang, Fenghao Zhu, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang, Sami Muhaidat, Mérouane Debbah
Abstract:
Large artificial intelligence (AI) models offer revolutionary potential for future wireless systems, promising unprecedented capabilities in network optimization and performance. However, current paradigms largely overlook crucial physical interactions. This oversight means they primarily rely on offline datasets, leading to difficulties in handling real-time wireless dynamics and non-stationary environments. Furthermore, these models often lack the capability for active environmental probing. This paper proposes a fundamental paradigm shift towards wireless embodied large AI (WELAI), moving from passive observation to active embodiment. We first identify key challenges faced by existing models, then we explore the design principles and system structure of WELAI. Besides, we outline prospective applications in next-generation wireless. Finally, through an illustrative case study, we demonstrate the effectiveness of WELAI and point out promising research directions for realizing adaptive, robust, and autonomous wireless systems.
中文摘要:本文提出无线具身大人工智能(WELAI)新范式,通过从被动观测转向主动环境交互,解决现有AI模型在实时无线动态适应中的不足,并阐述了其设计架构与应用前景。
English Summary: This paper introduces Wireless Embodied Large AI (WELAI) as a transformative paradigm that shifts from passive data processing to active environmental interaction, addressing limitations of current AI models in real-time wireless adaptability through novel design principles and demonstrated applications.
Authors:Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen
Abstract:
Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.
中文: 本研究将RLHF和DPO重新阐释为基于互信息最大化的对比学习方法,并提出互信息优化(MIO)方法,该方法避免了性能下降并在推理任务中表现更优。
English: This study reinterprets both RLHF and DPO as mutual information maximization methods using contrastive learning, and proposes Mutual Information Optimization (MIO) which outperforms them by avoiding performance decline and excelling in reasoning tasks.
Authors:Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo
Abstract:
Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.
中文摘要:提出的MinD模型通过双异步扩散过程,利用单步生成的低分辨率潜在特征实现实时机器人规划,在保持11.3 FPS运行速度的同时,既取得高任务成功率又能提前预警74%的潜在故障。
English Summary: The proposed MinD model employs dual asynchronous diffusion processes to enable real-time robotic planning by efficiently generating low-resolution latent features for action prediction, achieving high success rates and failure anticipation while operating at 11.3 FPS.
Authors:Chen Zhu, Kang Liang, Jianrong Bao, Zhouxiang Zhao, Zhaohui Yang, Zhaoyang Zhang, Mohammad Shikh-Bahaei
Abstract:
The advent of 6G networks demands unprecedented levels of intelligence, adaptability, and efficiency to address challenges such as ultra-high-speed data transmission, ultra-low latency, and massive connectivity in dynamic environments. Traditional wireless image transmission frameworks, reliant on static configurations and isolated source-channel coding, struggle to balance computational efficiency, robustness, and quality under fluctuating channel conditions. To bridge this gap, this paper proposes an AI-native deep joint source-channel coding (JSCC) framework tailored for resource-constrained 6G networks. Our approach integrates key information extraction and adaptive background synthesis to enable intelligent, semantic-aware transmission. Leveraging AI-driven tools, Mediapipe for human pose detection and Rembg for background removal, the model dynamically isolates foreground features and matches backgrounds from a pre-trained library, reducing data payloads while preserving visual fidelity. Experimental results demonstrate significant improvements in peak signal-to-noise ratio (PSNR) compared with traditional JSCC method, especially under low-SNR conditions. This approach offers a practical solution for multimedia services in resource-constrained mobile communications.
中文: 本文提出了一种面向6G网络的AI原生深度联合信源信道编码框架,通过智能提取关键特征和自适应合成背景来提升图像传输效率,在资源受限环境下实现了更优的性能表现。
English: This paper introduces an AI-native deep joint source-channel coding framework for 6G networks, which enhances image transmission efficiency by intelligently extracting key features and synthesizing adaptive backgrounds, achieving superior performance in resource-constrained environments.
Authors:Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
Abstract:
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
中文: 本文提出强化预训练(RPT)作为一种可扩展范式,通过将下一词预测重构为基于强化学习的推理任务,显著提升了语言建模准确性,并为后续微调提供了坚实基础。
English: This paper introduces Reinforcement Pre-Training (RPT), a scalable paradigm that reframes next-token prediction as a reasoning task using reinforcement learning, significantly improving language modeling accuracy and providing a strong foundation for further fine-tuning.
Authors:Qiujie Dong, Jiepeng Wang, Rui Xu, Cheng Lin, Yuan Liu, Shiqing Xin, Zichun Zhong, Xin Li, Changhe Tu, Taku Komura, Leif Kobbelt, Scott Schaefer, Wenping Wang
Abstract:
Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce CrossGen, a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for quad meshing by unifying geometry and cross field representations within a joint latent space. Our method enables extremely fast computation of high-quality cross fields of general input shapes, typically within one second without per-shape optimization. Our method assumes a point-sampled surface, also called a {\em point-cloud surface}, as input, so we can accommodate various surface representations by a straightforward point sampling process. Using an auto-encoder network architecture, we encode input point-cloud surfaces into a sparse voxel grid with fine-grained latent spaces, which are decoded into both SDF-based surface geometry and cross fields(see the teaser figure). We also contribute a dataset of models with both high-quality signed distance fields (SDFs) representations and their corresponding cross fields, and use it to train our network. Once trained, the network is capable of computing a cross field of an input surface in a feed-forward manner, ensuring high geometric fidelity, noise resilience, and rapid inference. Furthermore, leveraging the same unified latent representation, we incorporate a diffusion model for computing cross fields of new shapes generated from partial input, such as sketches. To demonstrate its practical applications, we validate CrossGen on the quad mesh generation task for a large variety of surface shapes. Experimental results...
中文: CrossGen提出了一种高效框架,通过将几何和交叉场表示统一在联合潜在空间中,为四边形网格生成高质量交叉场,无需逐形状优化即可实现快速计算。
English: CrossGen introduces a fast and efficient framework that generates high-quality cross fields for quad meshing by unifying geometry and cross field representations in a joint latent space, enabling rapid computation without per-shape optimization.
Authors:Hao Wang, Chengkai Hou, Xianglong Li, Yankai Fu, Chenxuan Li, Ning Chen, Gaole Dai, Jiaming Liu, Tiejun Huang, Shanghang Zhang
Abstract:
Learning to control high-speed objects in the real world remains a challenging frontier in robotics. Table tennis serves as an ideal testbed for this problem, demanding both rapid interception of fast-moving balls and precise adjustment of their trajectories. This task presents two fundamental challenges: it requires a high-precision vision system capable of accurately predicting ball trajectories, and it necessitates intelligent strategic planning to ensure precise ball placement to target regions. The dynamic nature of table tennis, coupled with its real-time response requirements, makes it particularly well-suited for advancing robotic control capabilities in fast-paced, precision-critical domains. In this paper, we present SpikePingpong, a novel system that integrates spike-based vision with imitation learning for high-precision robotic table tennis. Our approach introduces two key attempts that directly address the aforementioned challenges: SONIC, a spike camera-based module that achieves millimeter-level precision in ball-racket contact prediction by compensating for real-world uncertainties such as air resistance and friction; and IMPACT, a strategic planning module that enables accurate ball placement to targeted table regions. The system harnesses a 20 kHz spike camera for high-temporal resolution ball tracking, combined with efficient neural network models for real-time trajectory correction and stroke planning. Experimental results demonstrate that SpikePingpong achieves a remarkable 91% success rate for 30 cm accuracy target area and 71% in the more challenging 20 cm accuracy task, surpassing previous state-of-the-art approaches by 38% and 37% respectively. These significant performance improvements enable the robust implementation of sophisticated tactical gameplay strategies, providing a new research perspective for robotic control in high-speed dynamic tasks.
中文:SpikePingpong提出了一种结合脉冲视觉与模仿学习的机器人乒乓球系统,通过SONIC和IMPACT模块实现了毫米级精度的球拍接触预测和精准落点控制,实验结果表明其性能显著超越现有最佳方法。
English: SpikePingpong introduces a robotic table tennis system combining spike-based vision and imitation learning, achieving superior precision in ball tracking and strategic placement through its SONIC and IMPACT modules, with experimental results showing significant performance improvements over existing methods.
Authors:Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei
Abstract:
Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.
Chinese: ReSA通过结合块稀疏注意力与周期性密集校正,解决了稀疏解码中的KV缓存错位问题,在长序列生成中实现了近乎无损的质量和高达2.42倍的加速效果。
English: ReSA addresses KV cache misalignment in sparse decoding by combining block-sparse attention with periodic dense rectification, achieving near-lossless generation quality and up to 2.42× speedup for long sequences.
Authors:Fan Shi, Haiyang Yu, Bin Li, Xiangyang Xue
Abstract:
Humans can decompose Chinese characters into compositional components and recombine them to recognize unseen characters. This reflects two cognitive principles: Compositionality, the idea that complex concepts are built on simpler parts; and Learning-to-learn, the ability to learn strategies for decomposing and recombining components to form new concepts. These principles provide inductive biases that support efficient generalization. They are critical to Chinese character recognition (CCR) in solving the zero-shot problem, which results from the common long-tail distribution of Chinese character datasets. Existing methods have made substantial progress in modeling compositionality via predefined radical or stroke decomposition. However, they often ignore the learning-to-learn capability, limiting their ability to generalize beyond human-defined schemes. Inspired by these principles, we propose a deep latent variable model that learns Compositional Latent components of Chinese characters (CoLa) without relying on human-defined decomposition schemes. Recognition and matching can be performed by comparing compositional latent components in the latent space, enabling zero-shot character recognition. The experiments illustrate that CoLa outperforms previous methods in both character the radical zero-shot CCR. Visualization indicates that the learned components can reflect the structure of characters in an interpretable way. Moreover, despite being trained on historical documents, CoLa can analyze components of oracle bone characters, highlighting its cross-dataset generalization ability.
Chinese: 人类通过将汉字分解为组合部件并重新组合来识别未见过的汉字,这体现了组合性与学会学习的认知原则,而CoLa模型借鉴这些原则,无需预定义分解方案即可实现卓越的零样本识别能力。
English: Humans recognize unseen Chinese characters by decomposing them into compositional components and recombining them, guided by cognitive principles of compositionality and learning-to-learn, which the proposed CoLa model emulates to achieve superior zero-shot recognition without predefined decomposition schemes.
Authors:Liang Yue, Yihong Tang, Kehai Chen, Jie Liu, Min Zhang
Abstract:
Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.
中文: 提出的MASTER方法通过多智能体交互生成高质量数据来增强指令微调,使得模型在多项基准测试中表现卓越,并显著提升了复杂任务中的推理能力。
English: The proposed MASTER method enhances instruction fine-tuning by using multi-agent interactions to generate high-quality data, leading to models with improved performance and reasoning abilities across diverse benchmarks.
Authors:Yifan Duan, Yihong Tang, Kehai Chen, Liqiang Nie, Min Zhang
Abstract:
High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad applicability.To address these challenges, we propose ORPP (Optimized Role-Playing Prompt),a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model's intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model's few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining samples.Our experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP demonstrates superior "plug-and-play" capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.
中文: 提出的ORPP框架通过在小样本集上迭代优化生成高质量角色扮演提示,并利用少样本学习能力迁移优化经验,有效提升大语言模型性能,其表现超越主流提示优化方法且具备卓越的即插即用兼容性。
English: The proposed ORPP framework enhances large language model performance by generating optimized role-playing prompts through iterative training on small sample sets and leveraging few-shot learning for broader application, outperforming existing methods with superior plug-and-play compatibility.
Authors:Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng
Abstract:
Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, have been proposed to leverage a VLM-based System 2 model handling high-level reasoning and a separate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1 but also facilitates coordination between the reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: fast-in-slow.github.io.
中文摘要:提出的Fast-in-Slow(FiS)模型通过将高层推理与实时动作执行整合到统一的视觉语言动作框架中,解决了机器人操作中的关键难题,实现了卓越性能和高速控制能力。
English Summary: The proposed Fast-in-Slow (FiS) model addresses robotic manipulation challenges by integrating high-level reasoning and real-time action execution within a unified vision-language-action framework, achieving superior performance and high-frequency control.
Authors:Yihong Tang, Kehai Chen, Muyun Yang, Zhengyu Niu, Jing Li, Tiejun Zhao, Min Zhang
Abstract:
The advancement of Large Language Models (LLMs) has spurred significant interest in Role-Playing Agents (RPAs) for applications such as emotional companionship and virtual interaction. However, recent RPAs are often built on explicit dialogue data, lacking deep, human-like internal thought processes, resulting in superficial knowledge and style expression. While Large Reasoning Models (LRMs) can be employed to simulate character thought, their direct application is hindered by attention diversion (i.e., RPAs forget their role) and style drift (i.e., overly formal and rigid reasoning rather than character-consistent reasoning). To address these challenges, this paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift. Extensive experiments demonstrate that the proposed RAR significantly enhances the performance of RPAs by effectively addressing attention diversion and style drift.
中文摘要:本文提出角色感知推理方法,通过角色身份激活和推理风格优化解决注意力分散和风格漂移问题,显著提升了角色扮演智能体的表现。
English Summary: This paper introduces a Role-Aware Reasoning (RAR) method to enhance Role-Playing Agents by addressing attention diversion and style drift through Role Identity Activation and Reasoning Style Optimization, significantly improving their performance.
Authors:Ke Niu, Zhuofan Chen, Haiyang Yu, Yuwen Chen, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue
Abstract:
Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.
中文: CReFT-CAD提出了一种两阶段微调范式,结合课程驱动的强化学习和监督后调优,以提升CAD工作流中的推理精度和分布外泛化能力,并辅以TriView2CAD基准进行正投影评估。
English: CReFT-CAD introduces a two-stage fine-tuning paradigm combining curriculum-driven reinforcement learning and supervised post-tuning to enhance reasoning accuracy and out-of-distribution generalizability in CAD workflows, supported by the TriView2CAD benchmark for orthographic projection evaluation.
Authors:Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen, Liujuan Cao, Rongrong Ji
Abstract:
Reconstructing 3D objects from a single image remains challenging, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they typically assume fully visible inputs and fail when parts of the object are occluded, resulting in degraded 3D reconstruction quality. We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation that synthesizes six structurally consistent novel views directly from a single occluded image, enabling reliable 3D reconstruction without prior inpainting or manual annotations. Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, covering diverse occlusion levels, object categories, and masking patterns, providing a standardized protocol for future evaluation.
中文: DeOcc-1-to-3是一个端到端框架,能从单张遮挡图像生成六个结构一致的新视角,通过自监督训练实现可靠的3D重建,无需预先修复或人工标注。
English: DeOcc-1-to-3 is an end-to-end framework that generates six structurally consistent novel views from a single occluded image, enabling reliable 3D reconstruction through self-supervised training without requiring prior inpainting or manual annotations.
Authors:Kangcong Li, Peng Ye, Chongjun Tu, Lin Zhang, Chunfeng Song, Jiamin Wu, Tao Yang, Qihao Zheng, Tao Chen
Abstract:
While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.
中文摘要:PaceLLM通过模拟大脑工作记忆和皮层模块化,提出持续性活动机制和皮层专家聚类两项创新,有效解决大语言模型的长上下文信息衰减与语义碎片化问题,在多项评测中性能显著提升,并将可测量上下文扩展至20万词元。
English Summary: PaceLLM introduces brain-inspired innovations—a Persistent Activity Mechanism and Cortical Expert Clustering—to overcome long-context limitations in LLMs, achieving significant performance gains and extending context length to 200K tokens while enhancing interpretability.
Authors:Jonas R. Naujoks, Aleksander Krasowski, Moritz Weckbecker, Galip Ãmit Yolcu, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, René P. Klausen
Abstract:
Physics-informed neural networks (PINNs) offer a powerful approach to solving partial differential equations (PDEs), which are ubiquitous in the quantitative sciences. Applied to both forward and inverse problems across various scientific domains, PINNs have recently emerged as a valuable tool in the field of scientific machine learning. A key aspect of their training is that the data -- spatio-temporal points sampled from the PDE's input domain -- are readily available. Influence functions, a tool from the field of explainable AI (XAI), approximate the effect of individual training points on the model, enhancing interpretability. In the present work, we explore the application of influence function-based sampling approaches for the training data. Our results indicate that such targeted resampling based on data attribution methods has the potential to enhance prediction accuracy in physics-informed neural networks, demonstrating a practical application of an XAI method in PINN training.
Chinese: 基于影响函数的采样方法通过针对关键训练数据,能够提高物理信息神经网络的预测精度,展示了可解释人工智能在PINN训练中的实际应用。
English: Influence function-based sampling can enhance prediction accuracy in physics-informed neural networks by targeting key training data, showcasing a practical XAI application in PINN training.
Authors:Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, Zhaode Wang, Chengfei Lv, Shengyu Zhang, Fan Wu, Fei Wu
Abstract:
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.
中文: MadaKV是一种模态自适应的KV缓存淘汰策略,通过动态感知注意力头中的模态信息并自适应保留关键令牌,在保持精度的同时显著减少了KV缓存内存占用和模型推理解码延迟。
English: MadaKV is a modality-adaptive KV cache eviction strategy that enhances multimodal large language models' efficiency by dynamically adjusting to modality preferences and employing hierarchical compression, significantly reducing memory usage and decoding latency while maintaining accuracy.
Authors:Tobias Labarta, Nhi Hoang, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
Abstract:
As machine learning systems increasingly inform critical decisions, the need for human-understandable explanations grows. Current evaluations of Explainable AI (XAI) often prioritize technical fidelity over cognitive accessibility which critically affects users, in particular those with visual impairments. We propose CUE, a model for Cognitive Understanding of Explanations, linking explanation properties to cognitive sub-processes: legibility (perception), readability (comprehension), and interpretability (interpretation). In a study (N=455) testing heatmaps with varying colormaps (BWR, Cividis, Coolwarm), we found comparable task performance but lower confidence/effort for visually impaired users. Unlike expected, these gaps were not mitigated and sometimes worsened by accessibility-focused color maps like Cividis. These results challenge assumptions about perceptual optimization and support the need for adaptive XAI interfaces. They also validate CUE by demonstrating that altering explanation legibility affects understandability. We contribute: (1) a formalized cognitive model for explanation understanding, (2) an integrated definition of human-centered explanation properties, and (3) empirical evidence motivating accessible, user-tailored XAI.
中文摘要:本研究提出CUE模型,通过将解释特性与认知子过程(可读性、可理解性和可解释性)联系起来评估可解释人工智能,发现针对可访问性的色彩映射并不总能改善视障用户的理解,强调了自适应XAI界面的必要性。
English Summary: The study introduces the CUE model to evaluate explainable AI (XAI) by connecting explanation properties to cognitive processes, revealing that accessibility-focused color maps do not always improve understanding for visually impaired users and emphasizing the need for adaptive XAI interfaces.
Authors:Weicong Qin, Yi Xu, Weijie Yu, Teng Shi, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu
Abstract:
Personalized search systems in e-commerce platforms increasingly involve user interactions with AI assistants, where users consult about products, usage scenarios, and more. Leveraging consultation to personalize search services is trending. Existing methods typically rely on semantic similarity to align historical consultations with current queries due to the absence of 'value' labels, but we observe that semantic similarity alone often fails to capture the true value of consultation for personalization. To address this, we propose a consultation value assessment framework that evaluates historical consultations from three novel perspectives: (1) Scenario Scope Value, (2) Posterior Action Value, and (3) Time Decay Value. Based on this, we introduce VAPS, a value-aware personalized search model that selectively incorporates high-value consultations through a consultation-user action interaction module and an explicit objective that aligns consultations with user actions. Experiments on both public and commercial datasets show that VAPS consistently outperforms baselines in both retrieval and ranking tasks.
Chinese: 本文提出了一种名为VAPS的价值感知个性化搜索模型,通过从场景范围、后续行动和时间衰减三个维度评估历史咨询的价值,有选择地整合高价值咨询,在检索和排序任务中均优于现有方法。
English: This paper introduces a value-aware personalized search model called VAPS, which evaluates historical consultations from three perspectives—scenario scope, posterior action, and time decay—to selectively incorporate high-value consultations, outperforming existing methods in retrieval and ranking tasks.
Authors:Tianyu Zhan, Shengyu Zhang, Zheqi Lv, Jieming Zhu, Jiwei Li, Fan Wu, Fei Wu
Abstract:
With the rapid development of recommendation models and device computing power, device-based recommendation has become an important research area due to its better real-time performance and privacy protection. Previously, Transformer-based sequential recommendation models have been widely applied in this field because they outperform Recurrent Neural Network (RNN)-based recommendation models in terms of performance. However, as the length of interaction sequences increases, Transformer-based models introduce significantly more space and computational overhead compared to RNN-based models, posing challenges for device-based recommendation. To balance real-time performance and high performance on devices, we propose Device-Cloud \underline{Co}llaborative \underline{Corr}ection Framework for On-Device \underline{Rec}ommendation (CoCorrRec). CoCorrRec uses a self-correction network (SCN) to correct parameters with extremely low time cost. By updating model parameters during testing based on the input token, it achieves performance comparable to current optimal but more complex Transformer-based models. Furthermore, to prevent SCN from overfitting, we design a global correction network (GCN) that processes hidden states uploaded from devices and provides a global correction solution. Extensive experiments on multiple datasets show that CoCorrRec outperforms existing Transformer-based and RNN-based device recommendation models in terms of performance, with fewer parameters and lower FLOPs, thereby achieving a balance between real-time performance and high efficiency.
Chinese: CoCorrRec是一种设备-云协同框架,通过自校正和全局校正网络以低计算成本实现高性能,在设备端推荐中平衡了实时性与准确性。
English: CoCorrRec is a device-cloud collaborative framework that uses self-correction and global correction networks to achieve high performance with low computational overhead, balancing real-time efficiency and accuracy for on-device recommendation.
Authors:Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, Wenwu Zhu
Abstract:
Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.
Chinese: 本研究提出D-MoLE方法,通过在参数预算下动态演化多模态大语言模型架构,有效解决任务架构冲突和模态不平衡问题,在持续多模态指令调优中实现15%的性能提升。
English: This study introduces D-MoLE, a method that dynamically evolves MLLM architectures under parameter constraints to resolve task conflicts and modality imbalance, achieving a 15% performance gain in continual multimodal instruction tuning.
Authors:EmÃlio Dolgener Cantú, Rolf Klemens Wittmann, Oliver Abdeen, Patrick Wagner, Wojciech Samek, Moritz Baier, Sebastian Lapuschkin
Abstract:
Quality management in semiconductor manufacturing often relies on template matching with known golden standards. For Indium-Phosphide (InP) multi-project wafer manufacturing, low production scale and high design variability lead to such golden standards being typically unavailable. Defect detection, in turn, is manual and labor-intensive. This work addresses this challenge by proposing a methodology to generate a synthetic golden standard using Deep Neural Networks, trained to simulate photo-realistic InP wafer images from CAD data. We evaluate various training objectives and assess the quality of the simulated images on both synthetic data and InP wafer photographs. Our deep-learning-based method outperforms a baseline decision-tree-based approach, enabling the use of a 'simulated golden die' from CAD plans in any user-defined region of a wafer for more efficient defect detection. We apply our method to a template matching procedure, to demonstrate its practical utility in surface defect detection.
中文: 本研究提出了一种深度学习方法,能够从CAD数据生成合成黄金标准,用于磷化铟晶圆制造中的自动缺陷检测,有效解决了物理模板缺失的问题,并超越了传统方法的性能。
English: This study introduces a deep learning approach to create synthetic golden standards from CAD data for automated defect detection in InP wafer manufacturing, overcoming the absence of physical templates and outperforming traditional methods.
Authors:Xin Wang, Zeyang Zhang, Linxin Xiao, Haibo Chen, Chendi Ge, Wenwu Zhu
Abstract:
Multi-modal graphs, which integrate diverse multi-modal features and relations, are ubiquitous in real-world applications. However, existing multi-modal graph learning methods are typically trained from scratch for specific graph data and tasks, failing to generalize across various multi-modal graph data and tasks. To bridge this gap, we explore the potential of Multi-modal Graph Large Language Models (MG-LLM) to unify and generalize across diverse multi-modal graph data and tasks. We propose a unified framework of multi-modal graph data, task, and model, discovering the inherent multi-granularity and multi-scale characteristics in multi-modal graphs. Specifically, we present five key desired characteristics for MG-LLM: 1) unified space for multi-modal structures and attributes, 2) capability of handling diverse multi-modal graph tasks, 3) multi-modal graph in-context learning, 4) multi-modal graph interaction with natural language, and 5) multi-modal graph reasoning. We then elaborate on the key challenges, review related works, and highlight promising future research directions towards realizing these ambitious characteristics. Finally, we summarize existing multi-modal graph datasets pertinent for model training. We believe this paper can contribute to the ongoing advancement of the research towards MG-LLM for generalization across multi-modal graph data and tasks.
中文摘要:本文提出多模态图大语言模型(MG-LLM)作为统一框架,旨在解决现有方法无法泛化于不同多模态图数据与任务的局限性,并阐述了五个关键特性及未来研究方向。
English Summary: This paper introduces Multi-modal Graph Large Language Models (MG-LLM) as a unified framework to overcome the limitations of existing methods that lack generalization across diverse multi-modal graph data and tasks, outlining five key characteristics and future research directions.
Authors:Yichuan Wang, Shu Liu, Zhifei Li, Yongji Wu, Ziming Mao, Yilong Zhao, Xiao Yan, Zhiying Xu, Yang Zhou, Ion Stoica, Sewon Min, Matei Zaharia, Joseph E. Gonzalez
Abstract:
Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing 100 GB of raw data requires 150 to 700 GB of storage, making local deployment impractical. Reducing this overhead while maintaining search quality and latency becomes a critical challenge. In this paper, we present LEANN, a storage-efficient approximate nearest neighbor (ANN) search index optimized for resource-constrained personal devices. LEANN combines a compact graph-based structure with an efficient on-the-fly recomputation strategy to enable fast and accurate retrieval with minimal storage overhead. Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes, while maintaining 90% top-3 recall in under 2 seconds on real-world question answering benchmarks.
Chinese: LEANN提出了一种存储高效的近似最近邻搜索索引,将存储降至原始数据的5%以下,同时在个人设备上保持高检索精度和速度。
English: LEANN introduces a storage-efficient approximate nearest neighbor search index that reduces storage to under 5% of raw data while maintaining high retrieval accuracy and speed on personal devices.
Authors:Wei Chow, Yuan Gao, Linfeng Li, Xian Wang, Qi Xu, Hang Song, Lingdong Kong, Ran Zhou, Yi Zeng, Yidong Cai, Botian Jiang, Shilin Xu, Jiajun Zhang, Minghui Qiu, Xiangtai Li, Tianshu Yang, Siliang Tang, Juncheng Li
Abstract:
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
中文: 本文提出了首个多语言交错多条件语义检索数据集MERIT,并开发了Coral微调框架,通过嵌入重构和对比学习解决现有模型忽略细粒度条件元素的问题,实现了45.9%的性能提升。
English: This paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, and proposes Coral, a fine-tuning framework that addresses existing models' limitations by integrating embedding reconstruction and contrastive learning to achieve significant performance improvements.
Authors:Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen
Abstract:
While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
中文:Sparse-vDiT通过利用视频扩散变换器中发现的注意力稀疏模式,在保持多个模型视觉质量的同时,显著提升了推理速度并降低了计算复杂度。
English: Sparse-vDiT accelerates video diffusion transformers by leveraging identified sparsity patterns in attention mechanisms, achieving significant speedups and FLOP reductions while maintaining visual quality across multiple models.
Authors:Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue
Abstract:
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io.
中文: ThinkSound提出了一种思维链推理框架,将视频到音频生成分解为拟音生成、对象精修和语言编辑三个阶段,通过结构化视听推理实现了最先进的生成效果。
English: ThinkSound introduces a Chain-of-Thight reasoning framework that decomposes video-to-audio generation into three interactive stages—foley generation, object-centric refinement, and language-guided editing—achieving state-of-the-art performance through structured audio-visual reasoning.
Authors:Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie
Abstract:
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.
中文: 本文提出UniTime通用视频时序定位模型,利用生成式多模态大语言模型,能够基于自然语言查询精准定位任意时长和类型视频中的时间片段,在多个基准测试中超越现有方法,并显著提升长视频问答任务的准确性。
English: This paper introduces UniTime, a universal video temporal grounding model that uses generative Multi-modal Large Language Models to accurately locate moments in videos of any length or genre based on natural language queries, outperforming existing methods across multiple benchmarks and enhancing video question-answering tasks.
Authors:Jiayi He, Jiangyan Yi, Jianhua Tao, Siding Zeng, Hao Gu
Abstract:
With the development of audio deepfake techniques, attacks with partially deepfake audio are beginning to rise. Compared to fully deepfake, it is much harder to be identified by the detector due to the partially cryptic manipulation, resulting in higher security risks. Although some studies have been launched, there is no comprehensive review to systematically introduce the current situations and development trends for addressing this issue. Thus, in this survey, we are the first to outline a systematic introduction for partially deepfake audio manipulated region localization tasks, including the fundamentals, branches of existing methods, current limitations and potential trends, providing a revealing insight into this scope.
Chinese: 本综述首次系统性地概述了部分深度伪造音频的定位任务,涵盖基础原理、现有方法分支、当前局限与潜在趋势,为应对这种隐蔽性更强、危害更大的音频安全问题提供了重要参考。
English: This survey provides the first systematic overview of partially deepfake audio localization, covering fundamentals, existing methods, limitations, and future trends to address the growing security threats from these harder-to-detect manipulations.
Authors:Mayank Bumb, Anshul Vemulapalli, Sri Harsha Vardhan Prasad Jella, Anish Gupta, An La, Ryan A. Rossi, Hongjie Chen, Franck Dernoncourt, Nesreen K. Ahmed, Yu Wang
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated new possibilities for accurate and efficient time series analysis, but prior work often required heavy fine-tuning and/or ignored inter-series correlations. In this work, we explore simple and flexible prompt-based strategies that enable LLMs to perform time series forecasting without extensive retraining or the use of a complex external architecture. Through the exploration of specialized prompting methods that leverage time series decomposition, patch-based tokenization, and similarity-based neighbor augmentation, we find that it is possible to enhance LLM forecasting quality while maintaining simplicity and requiring minimal preprocessing of data. To this end, we propose our own method, PatchInstruct, which enables LLMs to make precise and effective predictions.
中文: 本研究提出了基于提示的简单策略,包括名为PatchInstruct的新方法,通过利用时间序列分解、分块标记化和相似性邻居增强技术,使大型语言模型无需大量重训练或复杂外部架构即可实现精确的时间序列预测。
English: This study introduces simple prompt-based strategies, including a novel method called PatchInstruct, that enable large language models to perform accurate time series forecasting without extensive retraining or complex external architectures by leveraging decomposition, tokenization, and neighbor augmentation techniques.
Authors:Yitong Zhou, Mingyue Cheng, Qingyang Mao, Yucong Luo, Qi Liu, Yupeng Li, Xiaohan Zhang, Deguang Liu, Xin Li, Enhong Chen
Abstract:
Chemical tables encode complex experimental knowledge through symbolic expressions, structured variables, and embedded molecular graphics. Existing benchmarks largely overlook this multimodal and domain-specific complexity, limiting the ability of multimodal large language models to support scientific understanding in chemistry. In this work, we introduce ChemTable, a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature. ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components and supports two core tasks: (1) Table Recognition, covering structure parsing and content extraction; and (2) Table Understanding, encompassing both descriptive and reasoning-oriented question answering grounded in table structure and domain semantics. We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights. Although models show reasonable performance on basic layout parsing, they exhibit substantial limitations on both descriptive and inferential QA tasks compared to human performance, and we observe significant performance gaps between open-source and closed-source models across multiple dimensions. These results underscore the challenges of chemistry-aware table understanding and position ChemTable as a rigorous and realistic benchmark for advancing scientific reasoning.
中文: 化学表格具有多模态复杂性,现有基准难以覆盖,因此开发了ChemTable这一大规模基准,包含专家标注,用于评估模型在表格识别与理解上的表现,结果显示尤其在推理任务上模型与人类性能存在显著差距。
English: Chemical tables present multimodal complexity that existing benchmarks miss, so ChemTable was created as a large-scale benchmark with expert annotations to evaluate models on table recognition and understanding, revealing significant gaps in performance especially for reasoning tasks compared to humans.
Authors:Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Jieping Ye
Abstract:
Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Whereas current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz. Experimental results on Spoken Question Answering benchmarks demonstrate that D RVOICE establishes new state-of-the-art (SOTA) performance among similar size speech foundation models with relative small amount of data.
Chinese: 近期端到端语音生成研究利用大语言模型开发了DrVoice,该并行语音-文本模型采用双分辨率机制,将输入频率降至5Hz,在少量数据下于口语问答基准上取得了最先进的性能。
English: Recent advancements in end-to-end speech generation using large language models (LLMs) have led to DrVoice, a parallel speech-text model with dual-resolution representations that reduces input frequency to 5Hz and achieves state-of-the-art performance on benchmarks with minimal data.
Authors:Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, Enhong Chen
Abstract:
Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal reasoning benchmarks demonstrate the effectiveness of our Perception-R1, which achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
提升多模态大语言模型的多模态推理能力需同时增强感知与推理,Perception-R1通过引入基于文本标注和一致性判断的视觉感知奖励机制,以少量数据显著提升了模型性能。
Enhancing multimodal reasoning in MLLMs requires improving both perception and reasoning, which is addressed by Perception-R1 through a novel visual perception reward that leverages textual annotations and consistency judgments to boost performance with minimal data.
Authors:Haoyuan Li, Yanpeng Zhou, Yufei Gao, Tao Tang, Jianhua Han, Yujie Yuan, Dave Zhenyu Chen, Jiawang Bian, Hang Xu, Xiaodan Liang
Abstract:
Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.
Chinese: 本文分析了3D场景中心视觉语言模型的性能差距,发现其对3D编码器依赖有限且易过度拟合语言线索,并提出了新数据集以促进真正的3D场景理解。
English: This paper analyzes the performance gap in 3D scene-centric Vision-Language Models (VLMs), identifying their limited reliance on 3D encoders and tendency to overfit linguistic cues, and introduces a novel dataset to enhance genuine 3D scene understanding.
Authors:Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen
Abstract:
Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
中文: 本文提出CogMath评估框架,通过三个认知阶段和九个细粒度维度全面评估大语言模型的数学能力,发现现有模型能力被高估30%-40%,并精确定位了其在不同维度的优势与不足。
English: The paper introduces CogMath, a novel evaluation framework that assesses large language models' mathematical abilities through three cognitive stages and nine fine-grained dimensions, revealing that current models' capabilities are significantly overestimated while pinpointing their specific strengths and weaknesses.
Authors:Junhao Yu, Yan Zhuang, YuXuan Sun, Weibo Gao, Qi Liu, Mingyue Cheng, Zhenya Huang, Enhong Chen
Abstract:
Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement. This is the first application of LLMs in adaptive testing. TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions. Experiments on psychological, educational, and lifestyle assessments show our approach achieves more accurate results with 20% fewer questions than state-of-the-art baselines, and testers preferred it in speed, smoothness, and other dimensions.
中文: TestAgent作为首个应用大语言模型的自适应测试代理,通过互动参与提升测试效果,在减少20%题量的同时获得更精确结果,解决了机械化算法和主观评估的局限性。
English: TestAgent, a novel LLM-powered agent, enhances adaptive testing by enabling interactive engagement, achieving more accurate results with 20% fewer questions while addressing issues like guessing and subjective assessments.
Authors:Maxime Gonthier, Dante D. Sanchez-Gallegos, Haochen Pan, Bogdan Nicolae, Sicheng Zhou, Hai Duc Nguyen, Valerie Hayot-Sasson, J. Gregory Pauloski, Jesus Carretero, Kyle Chard, Ian Foster
Abstract:
The exponential growth of data necessitates distributed storage models, such as peer-to-peer systems and data federations. While distributed storage can reduce costs and increase reliability, the heterogeneity in storage capacity, I/O performance, and failure rates of storage resources makes their efficient use a challenge. Further, node failures are common and can lead to data unavailability and even data loss. Erasure coding is a common resiliency strategy implemented in storage systems to mitigate failures by striping data across storage locations. However, erasure coding is computationally expensive and existing systems do not consider the heterogeneous resources and their varied capacity and performance when placing data chunks. We tackle the challenges of using erasure coding with distributed and heterogeneous nodes, aiming to store as much data as possible, minimize encoding and decoding time, and meeting user-defined reliability requirements for each data item. We propose two new dynamic scheduling algorithms, D-Rex LB and D-Rex SC, that adaptively choose erasure coding parameters and map chunks to heterogeneous nodes. D-Rex SC achieves robust performance for both storage utilization and throughput, at a higher computational cost, while D-Rex LB is faster but with slightly less competitive performance. In addition, we propose two greedy algorithms, GreedyMinStorage and GreedyLeastUsed, that optimize for storage utilization and load balancing, respectively. Our experimental evaluation shows that our dynamic schedulers store, on average, 45% more data items without significantly degrading I/O throughput compared to state-of-the-art algorithms, while GreedyLeastUsed is able to store 21% more data items while also increasing throughput.
中文: 本研究针对分布式异构存储系统中纠删码应用的挑战,提出了动态调度算法,相比现有方法显著提升了数据存储容量并保持了I/O吞吐性能。
English: This study addresses the challenges of using erasure coding in distributed heterogeneous storage systems by proposing dynamic scheduling algorithms that significantly improve data storage capacity and maintain I/O throughput compared to existing methods.
Authors:Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Vivek Gupta, Dinesh Manocha
Abstract:
Flowcharts are a critical tool for visualizing decision-making processes. However, their non-linear structure and complex visual-textual relationships make it challenging to interpret them using LLMs, as vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams. This leads to compromised reliability for automated flowchart processing in critical domains such as logistics, health, and engineering. We introduce the task of Fine-grained Flowchart Attribution, which traces specific components grounding a flowchart referring LLM response. Flowchart Attribution ensures the verifiability of LLM predictions and improves explainability by linking generated responses to the flowchart's structure. We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning. It first segments the flowchart, then converts it into a structured symbolic graph, and then employs an agentic approach to dynamically interact with the graph, to generate attribution paths. Additionally, we present FlowExplainBench, a novel benchmark for evaluating flowchart attributions across diverse styles, domains, and question types. Experimental results show that FlowPathAgent mitigates visual hallucinations in LLM answers over flowchart QA, outperforming strong baselines by 10-14% on our proposed FlowExplainBench dataset.
中文摘要:FlowPathAgent通过基于图的推理进行细粒度归因,有效减少大语言模型在解读流程图时的视觉幻觉,提升了关键领域应用的可靠性。
English Summary: FlowPathAgent, a neurosymbolic agent using graph-based reasoning, effectively mitigates visual hallucinations in LLM interpretations of flowcharts by providing fine-grained attribution, improving reliability in critical applications.
Authors:Gaozheng Pei, Ke Ma, Dongpeng Zhang, Chengzhi Sun, Qianqian Xu, Qingming Huang
Abstract:
Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.
中文摘要:该统一框架将传统迁移性增强策略融入基于扩散模型的对抗样本生成,有效解决了在Deepfake检测等任务中的泛化难题,并在国际竞赛中荣获冠军验证了其卓越效能。
English Summary: The proposed unified framework integrates traditional transferability enhancement strategies into diffusion model-based adversarial example generation, overcoming limitations in generalization across tasks like Deepfake detection and achieving top competition results.
Authors:Langming Liu, Wanyu Wang, Chi Zhang, Bo Li, Hongzhi Yin, Xuetao Wei, Wenbo Su, Bo Zheng, Xiangyu Zhao
Abstract:
Online advertising in recommendation platforms has gained significant attention, with a predominant focus on channel recommendation and budget allocation strategies. However, current offline reinforcement learning (RL) methods face substantial challenges when applied to sparse advertising scenarios, primarily due to severe overestimation, distributional shifts, and overlooking budget constraints. To address these issues, we propose MTORL, a novel multi-task offline RL model that targets two key objectives. First, we establish a Markov Decision Process (MDP) framework specific to the nuances of advertising. Then, we develop a causal state encoder to capture dynamic user interests and temporal dependencies, facilitating offline RL through conditional sequence modeling. Causal attention mechanisms are introduced to enhance user sequence representations by identifying correlations among causal states. We employ multi-task learning to decode actions and rewards, simultaneously addressing channel recommendation and budget allocation. Notably, our framework includes an automated system for integrating these tasks into online advertising. Extensive experiments on offline and online environments demonstrate MTORL's superiority over state-of-the-art methods.
Chinese: 摘要提出MTORL,一种多任务离线强化学习模型,通过因果状态编码器和多任务学习解决稀疏在线广告中的高估和分布偏移问题,同时处理渠道推荐和预算分配,实验证明其性能优于现有方法。
English: The abstract introduces MTORL, a multi-task offline reinforcement learning model designed to overcome challenges like overestimation and distribution shifts in sparse online advertising by using a causal state encoder and multi-task learning for channel recommendation and budget allocation, showing superior performance in experiments.
Authors:Hanlin Dong, Arian Prabowo, Hao Xue, Flora D. Salim
Abstract:
Air quality prediction is a challenging forecasting task due to its spatio-temporal complexity and the inherent dynamics as well as uncertainty. Most of the current models handle these two challenges by applying Graph Neural Networks or known physics principles, and quantifying stochasticity through probabilistic networks like Diffusion models. Nevertheless, finding the right balancing point between the certainties and uncertainties remains an open question. Therefore, we propose Double-Diffusion, a novel diffusion probabilistic model that harnesses the power of known physics to guide air quality forecasting with stochasticity. To the best of our knowledge, while precedents have been made of using conditional diffusion models to predict air pollution, this is the first attempt to use physics as a conditional generative approach for air quality prediction. Along with a sampling strategy adopted from image restoration and a new denoiser architecture, Double-Diffusion ranks first in most evaluation scenarios across two real-life datasets compared with other probabilistic models, it also cuts inference time by 50% to 30% while enjoying an increase between 3-12% in Continuous Ranked Probabilistic Score (CRPS).
中文: 提出的双扩散模型将物理原理与概率网络相结合,以提升空气质量预测的准确性,在多项评估指标中均表现出卓越的性能和效率。
English: The proposed Double-Diffusion model integrates physics principles with probabilistic networks to enhance air quality forecasting, achieving superior performance and efficiency across multiple evaluation metrics.
Authors:Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP's empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as "image not found." Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.
中文摘要:本研究通过对比学习提出语义信息度量方法,量化图像与文本相互影响分布的程度,重新定义了视觉-语言领域的信息增益概念,并证明其可通过嵌入范数高效计算且具有高度相关性。
English Summary: This study introduces a semantic informativeness metric using contrastive learning to quantify how images and texts distort each other's distributions, redefining Information Gain for vision-language domains and demonstrating its efficient computation with high correlation to embedding norms.
Authors:Peibo Li, Shuang Ao, Hao Xue, Yang Song, Maarten de Rijke, Johan Barthélemy, Tomasz Bednarz, Flora D. Salim
Abstract:
Large language models (LLMs) have been adopted for next point-of-interest (POI) recommendation tasks. Typical LLM-based recommenders fall into two categories: prompt-based and supervised fine-tuning (SFT)-based models. Prompt-based models generally offer greater output flexibility but deliver lower accuracy, whereas SFT-based models achieve higher performance yet face a fundamental mismatch: next POI recommendation data does not naturally suit supervised fine-tuning. In SFT, the model is trained to reproduce the exact ground truth, but each training example provides only a single target POI, so there is no ground truth for producing a top-k list.
To address this, we propose Refine-POI, a reinforcement fine-tuning framework for next POI recommendation. We introduce recommendation-driven rewards that enable LLMs to learn to generate top-k recommendation lists using only one ground-truth POI per example. Experiments on real-world datasets demonstrate that Refine-POI achieves state-of-the-art top-k recommendation performance.
中文摘要:Refine-POI是一个强化微调框架,通过引入推荐驱动的奖励机制,使大语言模型能够仅凭单个真实POI数据生成精确的top-k兴趣点推荐列表,实现了最先进的推荐性能。
English Summary: Refine-POI is a reinforcement fine-tuning framework that enables large language models to generate accurate top-k point-of-interest recommendations using only single ground-truth POIs per training example, achieving state-of-the-art performance.
Authors:Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, Xipeng Qiu
Abstract:
Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios involving unfamiliar environments and multi-step goals. Current approaches rely on environment-agnostic imitation learning that disconnects instructions from environmental contexts, causing models to struggle with context-sensitive instructions and rely on supplementary cues rather than visual reasoning during long-horizon interactions. In this work, we propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding through four cognitive capabilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic grounding) while developing and evaluating models using only raw visual observations through curriculum learning. Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5-VL achieving a 60.7 absolute improvement in task success rates, particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). Notably, our enhanced open-source models outperform proprietary systems like GPT-4o and Claude-3.5-Sonnet by a large margin.
中文: 大型视觉语言模型在复杂具身规划任务中表现不佳,但提出的世界感知规划框架通过全面环境理解和课程学习显著提升了模型性能,在基准测试中取得大幅进步并超越专有系统。
English: Large Vision-Language Models struggle with complex embodied planning tasks due to their environment-agnostic approach, but the proposed World-Aware Planning framework significantly enhances their performance through comprehensive environmental understanding and curriculum learning, achieving substantial improvements on benchmarks and surpassing proprietary systems.
Authors:Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li
Abstract:
Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model's insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.
中文摘要:本研究提出的Unlasting框架采用基于基因调控网络的双条件扩散模型解决单细胞扰动数据不配对问题,通过掩蔽机制预测沉默基因,并引入更符合生物学特性的评估指标来捕捉细胞异质性。
English Summary: The proposed Unlasting framework employs dual conditional diffusion models guided by gene regulatory networks to address unpaired single-cell perturbation data, incorporating a masking mechanism to predict silent genes and introducing a biologically meaningful evaluation metric for cellular heterogeneity.
Authors:Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu
Abstract:
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
中文:行为编辑作为一种模型编辑方法被提出,通过基于心理学道德理论的多层级基准BehaviorBench,能够动态引导基于大语言模型的智能体行为,在特定场景和全局道德层面对不同模型均展现出有效调控能力。
English: Behavior Editing is introduced as a model editing approach to dynamically steer LLM-based agents' ethical conduct, utilizing the BehaviorBench benchmark to demonstrate its effectiveness in both local scenario adjustments and global moral alignment shifts across diverse models.
Authors:Mimo Shirasaka, Yuya Ikeda, Tatsuya Matsushima, Yutaka Matsuo, Yusuke Iwasawa
Abstract:
The ability to update information acquired through various means online during task execution is crucial for a general-purpose service robot. This information includes geometric and semantic data. While SLAM handles geometric updates on 2D maps or 3D point clouds, online updates of semantic information remain unexplored. We attribute the challenge to the online scene graph representation, for its utility and scalability. Building on previous works regarding offline scene graph representations, we study online graph representations of semantic information in this work. We introduce SPARK: Spatial Perception and Robot Knowledge Integration. This framework extracts semantic information from environment-embedded cues and updates the scene graph accordingly, which is then used for subsequent task planning. We demonstrate that graph representations of spatial relationships enhance the robot system's ability to perform tasks in dynamic environments and adapt to unconventional spatial cues, like gestures.
中文摘要:SPARK框架通过在线场景图使服务机器人能动态更新语义信息,利用手势等空间线索增强任务规划能力,提升在动态环境中的适应性。
English Summary: SPARK is a framework that enables service robots to dynamically update semantic information through online scene graphs, enhancing task planning and adaptability in dynamic environments using spatial cues like gestures.
Authors:Shulun Chen, Wei Shao, Flora D. Salim, Hao Xue
Abstract:
Supporting decision-making has long been a central vision in the field of spatio-temporal intelligence. While prior work has improved the timeliness and accuracy of spatio-temporal forecasting, converting these forecasts into actionable strategies remains a key challenge. A main limitation is the decoupling of the prediction and the downstream decision phases, which can significantly degrade the downstream efficiency. For example, in emergency response, the priority is successful resource allocation and intervention, not just incident prediction. To this end, it is essential to propose an Adaptive Spatio-Temporal Early Decision model (ASTER) that reforms the forecasting paradigm from event anticipation to actionable decision support. This framework ensures that information is directly used for decision-making, thereby maximizing overall effectiveness. Specifically, ASTER introduces a new Resource-aware Spatio-Temporal interaction module (RaST) that adaptively captures long- and short-term dependencies under dynamic resource conditions, producing context-aware spatiotemporal representations. To directly generate actionable decisions, we further design a Preference-oriented decision agent (Poda) based on multi-objective reinforcement learning, which transforms predictive signals into resource-efficient intervention strategies by deriving optimal actions under specific preferences and dynamic constraints. Experimental results on four benchmark datasets demonstrate the state-of-the-art performance of ASTER in improving both early prediction accuracy and resource allocation outcomes across six downstream metrics.
中文摘要:ASTER模型通过资源感知的时空交互模块和偏好导向的决策代理,将预测范式从事件预警转变为可操作的决策支持,显著提升了预测精度与资源配置效果。
English Summary: The ASTER model transforms spatio-temporal forecasting into actionable decision support by integrating resource-aware predictions with preference-driven strategies, enhancing both prediction accuracy and resource allocation efficiency.
Authors:Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, Qingming Huang
Abstract:
Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail; and 2) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively combine all available information. To overcome these challenges, we propose SSLProfiler, a non-contrastive SSL framework specifically designed for cell profiling. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, SSLProfiler won the Cell Line Transferability challenge at CVPR 2025.
Chinese: 本文提出SSLProfiler,一种专为细胞图像设计的非对比自监督学习框架,通过定制化的数据增强和表征后处理方法有效解决了现有技术对细胞图像的适配难题,并荣获CVPR 2025细胞系可迁移性挑战赛冠军。
English: This paper introduces SSLProfiler, a non-contrastive self-supervised learning framework that overcomes challenges in adapting SSL to cell images by using specialized data augmentation and representation post-processing, winning the CVPR 2025 Cell Line Transferability challenge.
Authors:Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Abstract:
In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B
Chinese: 本研究证明,通过将监督微调与精心校准的强化学习相结合,特别是保持采样温度使调整后的熵约为0.3,能显著提升推理模型性能,AceReason-Nemotron-1.1 7B模型在数学和编程基准测试中取得的顶尖成果即印证了此方法的有效性。
English: This study demonstrates that combining supervised fine-tuning with carefully calibrated reinforcement learning, particularly by maintaining a sampling temperature that adjusts entropy to around 0.3, significantly enhances reasoning model performance, as evidenced by the state-of-the-art results of the AceReason-Nemotron-1.1 7B model on math and coding benchmarks.
Authors:Zijie Lin, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Fuli Feng, Tat-Seng Chua
Abstract:
Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding. To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item. Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance. Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding. Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG. In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens. Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
中文摘要:本研究提出基于信息增益的决策感知令牌处理(IGD)策略,通过在模型优化和解码阶段重点处理高决策性令牌,相比传统的似然最大化方法显著提升了推荐系统的准确性。
English Summary: The study introduces an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that prioritizes high-decisiveness tokens during model tuning and decoding, significantly improving recommendation accuracy over conventional likelihood-maximization approaches.
Authors:Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone
Abstract:
Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.
中文: 自回归Transformer因其可扩展性和泛化潜力,正越来越多地被用作机器人和自动驾驶车辆的端到端策略架构,我们提出的基于三平面的多摄像头标记化方法可减少多达72%的标记,在保持精度的同时将推理速度提升50%。
English: Autoregressive Transformers are gaining traction as end-to-end policies for robots and autonomous vehicles due to their scalability and generalization potential, with our triplane-based multi-camera tokenization method reducing tokens by up to 72% and speeding up inference by 50% while maintaining accuracy.
Authors:Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Abstract:
Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.
Chinese: FourierAttention是一种无需训练的框架,通过利用Transformer头维度的异质性角色,将长上下文不敏感的维度投影到傅里叶基上,从而压缩大型语言模型的KV缓存,在LongBench和NIAH基准测试中实现了最佳的长上下文准确性,并通过定制的Triton内核优化内存。
English: FourierAttention is a training-free framework that compresses the KV cache in large language models by leveraging the heterogeneous roles of transformer head dimensions and projecting long-context-insensitive dimensions onto Fourier bases, achieving superior long-context accuracy on benchmarks like LongBench and NIAH while optimizing memory with a custom Triton kernel.
Authors:David Acuna, Ximing Lu, Jaehun Jung, Hyunwoo Kim, Amlan Kar, Sanja Fidler, Yejin Choi
Abstract:
Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning -- akin to the success observed in language models -- via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces -- without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model's output stream. We show that framing reasoning as a search process -- where subquestions act as latent decisions within a broader inference trajectory -- helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.
中文摘要:本文提出一种受蒙特卡洛树搜索启发的算法,通过注入子问题-子答案对使非推理视觉语言模型能够进行长链推理,在多个基准测试中取得稳定提升,其中人文科目显著提高9%。
English Summary: This paper introduces a Monte Carlo Tree Search-inspired algorithm that enables non-reasoning vision-language models to perform long-form reasoning by injecting subquestion-subanswer pairs, achieving consistent improvements across benchmarks including a 9% gain in Liberal Arts.
Authors:Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin
Abstract:
Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model's attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model's capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student's attention with the teacher's focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning and code generation benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
中文: 提出的“学习聚焦”框架通过基于干预的推理识别并剪除混淆标记,有效缓解了大语言模型中的注意力分散问题,从而在多项基准测试中提升了推理准确性和生成质量。
English: The proposed Learning to Focus (LeaF) framework mitigates distracting patterns in large language models by identifying and pruning confounding tokens through intervention-based inference, thereby improving reasoning accuracy and generation quality across benchmarks.
Authors:Jae Sung Park, Zixian Ma, Linjie Li, Chenhao Zheng, Cheng-Yu Hsieh, Ximing Lu, Khyathi Chandu, Quan Kong, Norimasa Kobori, Ali Farhadi, Yejin Choi, Ranjay Krishna
Abstract:
Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
Chinese: 本研究推出了ROBIN模型,通过密集标注的关系进行指令微调,能够生成高质量场景图,在视觉关系理解任务中取得领先性能,且训练规模更小但效果显著。
English: The study introduces ROBIN, a multimodal language model fine-tuned with densely annotated relationships to generate high-quality scene graphs, achieving state-of-the-art performance in visual reasoning tasks despite its smaller training scale.
Authors:Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Siran Yang, Yingshui Tan, Huimin Yi, Yuchi Xu, Yujin Yuan, Xingyao Zhang, Lin Qu, Wenbo Su, Wei Wang, Jiamang Wang, Bo Zheng
Abstract:
We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline. Second, the parallel strategy and data transfer modules enable efficient and scalable training. Third, the rollout scheduler offers fine-grained management of each sample's lifecycle during the rollout stage. Fourth, the environment worker and reward worker support rapid and flexible experimentation with agentic RL algorithms and reward designs. Finally, AutoDeviceMapping allows users to assign resources to different models flexibly across various stages.
中文: ROLL是一个高效、可扩展且用户友好的强化学习优化库,专为大规模学习设计,通过单控制器架构、并行策略模块和灵活的资源分配,满足技术先锋、开发者和研究人员的多样化需求。
English: ROLL is an efficient, scalable, and user-friendly reinforcement learning library designed for large-scale optimization, featuring a single-controller architecture, parallel strategy modules, and flexible resource management to serve tech pioneers, developers, and researchers.
Authors:Yihang Wang, Yuying Qiu, Peng Chen, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo
Abstract:
Existing works on general time series forecasting build foundation models with heavy model parameters through large-scale multi-source pre-training. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverages historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot settings with much better efficiency compared with existing time series foundation models.
中文:LightGTS是一种轻量级通用时间序列预测模型,通过周期性标记化和周期性并行解码技术,利用时间序列固有的周期性偏差,在多个基准测试中以高效方式实现了顶尖性能。
English: LightGTS is a lightweight general time series forecasting model that leverages consistent periodical modeling through Periodical Tokenization and Periodical Parallel Decoding to achieve state-of-the-art performance with high efficiency across multiple benchmarks.
Authors:Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.
Chinese: 本研究引入推理图分析大规模推理模型,发现蒸馏模型具有更多循环、更大直径和更强的小世界特性,这些结构与性能提升相关,并为优化数据集设计提供了指导。
English: This study introduces reasoning graphs to analyze large-scale reasoning models, revealing that distilled models exhibit more cycles, larger diameters, and stronger small-world properties, which correlate with improved performance and offer guidelines for enhancing dataset design.
Authors:Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.
Chinese: 本研究引入推理图分析大规模推理模型,发现蒸馏模型具有更多循环、更大直径和更强的小世界特性,这些结构与性能提升相关,并为优化数据集设计提供了指导。
English: This study introduces reasoning graphs to analyze large-scale reasoning models, revealing that distilled models exhibit more cycles, larger diameters, and stronger small-world properties, which correlate with improved performance and offer guidelines for enhancing dataset design.
Authors:Yifan Wang, Peishan Yang, Zhen Xu, Jiaming Sun, Zhanhua Zhang, Yong Chen, Hujun Bao, Sida Peng, Xiaowei Zhou
Abstract:
This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin. Project page: https://zju3dv.github.io/freetimegs/ .
中文: 本文提出FreeTimeGS,一种创新的4D表示方法,允许高斯基元在任意时间和位置出现,增强了复杂运动动态3D场景的建模能力并减少时间冗余,在渲染质量上大幅超越现有方法。
English: This paper introduces FreeTimeGS, a novel 4D representation that enables Gaussian primitives to appear at any time and location, enhancing the modeling of dynamic 3D scenes with complex motions and reducing temporal redundancy, significantly outperforming recent methods in rendering quality.
Authors:Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng
Abstract:
Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
中文摘要:本文提出ReVisual-R1模型,通过分阶段训练和优化初始化策略突破多模态推理瓶颈,在多项基准测试中实现最优性能。
English Summary: This paper introduces ReVisual-R1, a novel MLLM that overcomes multimodal reasoning limitations through staged training and optimized initialization, achieving state-of-the-art performance on multiple benchmarks.
Authors:Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
Abstract:
Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({\sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {\sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {\sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {\sc Turbo} achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.
中文: 本文提出Turbo框架,通过利用训练期间的特权结构化表格来弥合模态差距并提升推理准确性,在有限数据下实现了多模态表格推理的最优性能。
English: This paper introduces Turbo, a framework that enhances multimodal tabular reasoning by leveraging privileged structured tables during training to bridge the modality gap and improve reasoning accuracy, achieving state-of-the-art results with limited data.
Authors:Xiwei Xu, Hans Weytjens, Dawen Zhang, Qinghua Lu, Ingo Weber, Liming Zhu
Abstract:
Recent studies show that 60% of LLM-based compound systems in enterprise environments leverage some form of retrieval-augmented generation (RAG), which enhances the relevance and accuracy of LLM (or other genAI) outputs by retrieving relevant information from external data sources. LLMOps involves the practices and techniques for managing the lifecycle and operations of LLM compound systems in production environments. It supports enhancing LLM systems through continuous operations and feedback evaluation. RAGOps extends LLMOps by incorporating a strong focus on data management to address the continuous changes in external data sources. This necessitates automated methods for evaluating and testing data operations, enhancing retrieval relevance and generation quality. In this paper, we (1) characterize the generic architecture of RAG applications based on the 4+1 model view for describing software architectures, (2) outline the lifecycle of RAG systems, which integrates the management lifecycles of both the LLM and the data, (3) define the key design considerations of RAGOps across different stages of the RAG lifecycle and quality trade-off analyses, (4) highlight the overarching research challenges around RAGOps, and (5) present two use cases of RAG applications and the corresponding RAGOps considerations.
中文摘要:近期研究表明60%的企业级LLM系统采用检索增强生成技术提升输出质量,本文提出RAGOps作为LLMOps的延伸框架,重点解决数据生命周期管理问题,并系统阐述了RAG应用的架构设计、生命周期整合及运维实践。
English Summary: Recent enterprise studies show that 60% of LLM systems use retrieval-augmented generation (RAG) to improve output quality, while this paper introduces RAGOps as an extension of LLMOps focusing on data lifecycle management and presents architectural analysis, lifecycle frameworks, and operational considerations for RAG systems.
Authors:Long Tan Le, Senura Hansaja Wanasekara, Zerun Niu, Yansong Shi, Nguyen H. Tran, Phuong Vo, Walid Saad, Dusit Niyato, Zhu Han, Choong Seon Hong, H. Vincent Poor
Abstract:
6G wireless systems are expected to support massive volumes of data with ultra-low latency. However, conventional bit-level transmission strategies cannot support the efficiency and adaptability required by modern, data-intensive applications. The concept of semantic communication (SemCom) addresses this limitation by focusing on transmitting task-relevant semantic information instead of raw data. While recent efforts incorporating deep learning and large-scale AI models have improved SemCom's performance, existing systems remain vulnerable to both semantic-level and transmission-level noise because they often rely on domain-specific architectures that hinder generalizability. In this paper, a novel and generalized semantic communication framework called WaSeCom is proposed to systematically address uncertainty and enhance robustness. In particular, Wasserstein distributionally robust optimization is employed to provide resilience against semantic misinterpretation and channel perturbations. A rigorous theoretical analysis is performed to establish the robust generalization guarantees of the proposed framework. Experimental results on image and text transmission demonstrate that WaSeCom achieves improved robustness under noise and adversarial perturbations. These results highlight its effectiveness in preserving semantic fidelity across varying wireless conditions.
中文:提出的WaSeCom框架采用Wasserstein鲁棒优化技术,有效提升语义通信在噪声和对抗干扰下的稳健性,确保在不同无线环境中保持可靠的语义保真度。
English: The proposed WaSeCom framework utilizes Wasserstein robust optimization to enhance semantic communication's resilience against noise and adversarial attacks, ensuring reliable performance across diverse wireless environments.
Authors:Ziye Jia, Can Cui, Chao Dong, Qihui Wu, Zhuang Ling, Dusit Niyato, Zhu Han
Abstract:
With an extensive increment of computation demands, the aerial multi-access edge computing (MEC), mainly based on unmanned aerial vehicles (UAVs) and high altitude platforms (HAPs), plays significant roles in future network scenarios. In detail, UAVs can be flexibly deployed, while HAPs are characterized with large capacity and stability. Hence, in this paper, we provide a hierarchical model composed of an HAP and multi-UAVs, to provide aerial MEC services. Moreover, considering the errors of channel state information from unpredictable environmental conditions, we formulate the problem to minimize the total energy cost with the chance constraint, which is a mixed-integer nonlinear problem with uncertain parameters and intractable to solve. To tackle this issue, we optimize the UAV deployment via the weighted K-means algorithm. Then, the chance constraint is reformulated via the distributionally robust optimization (DRO). Furthermore, based on the conditional value-at-risk mechanism, we transform the DRO problem into a mixed-integer second order cone programming, which is further decomposed into two subproblems via the primal decomposition. Moreover, to alleviate the complexity of the binary subproblem, we design a binary whale optimization algorithm. Finally, we conduct extensive simulations to verify the effectiveness and robustness of the proposed schemes by comparing with baseline mechanisms.
中文: 针对空中多接入边缘计算,本文提出了由高空平台和多无人机组成的层次模型,通过加权K均值算法和二进制鲸鱼优化算法等方法,在信道不确定条件下有效降低了系统能耗。
English: Aerial multi-access edge computing using UAVs and HAPs is modeled hierarchically to minimize energy costs under channel uncertainty, with optimization methods including weighted K-means and a binary whale algorithm proving effective in simulations.
Authors:Wenyan Cong, Yiqing Liang, Yancheng Zhang, Ziyi Yang, Yan Wang, Boris Ivanovic, Marco Pavone, Chen Chen, Zhangyang Wang, Zhiwen Fan
Abstract:
Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants, but systematic evaluation is lacking. In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
Chinese: 本文首次提出了针对三维几何基础模型的全面基准,通过标准化流程评估了16种前沿模型在五项核心空间智能任务中的表现,揭示了其优势与不足。
English: This paper introduces the first comprehensive benchmark for 3D geometric foundation models (GFMs), evaluating 16 state-of-the-art models across five core spatial intelligence tasks using standardized protocols to reveal their capabilities and limitations.
Authors:Farong Wen, Yijin Guo, Junying Wang, Jiaohao Xiao, Yingjie Zhou, Ye Shen, Qi Jia, Chunyi Li, Zicheng Zhang
Abstract:
The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.
中文: 本文提出的MLLM Interview (MITV)策略通过构建带难度标注的访谈数据集和渐进式测试方法,能用少量问题快速评估多模态大语言模型的性能极限。
English: The MLLM Interview (MITV) strategy is proposed to efficiently evaluate Multimodal Large Language Models by testing fewer questions with difficulty labels, enabling faster performance assessment while maintaining accuracy.
Authors:Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Abstract:
High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.
Chinese: 本文提出V-Synthesis方法,通过新型V-Score指标从零生成上下文学习示例,在降低标注成本的同时确保高一致性和多样性,实验表明其性能比现有方法平均提升2.0%。
English: This paper introduces V-Synthesis, a method that uses the novel V-Score metric to synthesize in-context learning demonstrations from scratch, ensuring high consistency and diversity while reducing labeling costs and outperforming existing methods by 2.0%.
Authors:Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Abstract:
In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.
中文摘要:学习上下文斜率(LCS)作为一种新指标,通过建模学习增益与上下文相关性之间的关系来可靠量化上下文学习效果,其通过提升可靠性、改进失败归因和降低数据依赖性,克服了基于性能评估方法的局限性。
English Summary: The Learning-to-Context Slope (LCS) is introduced as a novel metric to reliably quantify in-context learning effectiveness by modeling the relationship between learning gain and contextual relevance, overcoming limitations of performance-based evaluations through improved reliability, better failure attribution, and reduced data dependency.
Authors:Leander Melroy Maben, Gayathri Ganesh Lakshmy, Srijith Radhakrishnan, Siddhant Arora, Shinji Watanabe
Abstract:
Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.
中文: AURA是首个开源的语音原生助手,支持多轮对话和工具集成,在基准测试和人工评估中表现出色。
English: AURA is the first open-source, speech-native assistant that enables multi-turn dialogue with integrated tool use and agentic reasoning, achieving high performance on benchmarks and human evaluations.
Authors:Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
Abstract:
Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.
中文: 可分解流匹配(DFM)是一种简单有效的框架,在用户定义的多尺度表示的每个层级上独立应用流匹配,通过架构简洁性和最小修改显著提升了图像和视频的视觉质量。
English: Decomposable Flow Matching (DFM) is a simple yet effective framework that independently applies Flow Matching at each level of a multi-scale representation, improving visual quality for images and videos with architectural simplicity and minimal modifications.
Authors:Yubo Huang, Weiqiang Wang, Sirui Zhao, Tong Xu, Lin Liu, Enhong Chen
Abstract:
Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds `who' and `speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.
Chinese Summary: 近年来音频驱动说话人生成技术进展显著,但现有方法主要针对单角色场景,多角色同场景对话视频生成面临音频-角色对应控制及数据集缺乏两大挑战,为此研究团队提出了基于MM-DiT的Bind-Your-Avatar模型及配套数据集解决方案。
English Summary: Recent advances in audio-driven talking head generation have focused on single-character scenarios, with the challenge of creating unified conversation videos featuring multiple co-present characters in the same environment remaining largely unaddressed due to audio-character correspondence control and dataset limitations.
Authors:Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka
Abstract:
We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.
Chinese Summary: 本文提出了一种创新的融合4D生成框架,通过统一层内的时空注意力机制并结合高斯粒子增强的3D重建算法,在视频质量和重建能力两方面均实现了最先进的性能突破。
English Summary: This paper introduces a novel fused 4D generation framework that integrates spatial and temporal attention within unified layers and enhances 3D reconstruction with Gaussian particles, achieving state-of-the-art performance in both video quality and reconstruction accuracy.
Authors:Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, Chuan Guo
Abstract:
We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers' motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination.
Chinese: DuetGen是一种新颖的框架,通过分层运动标记化和掩码变换器的两阶段方法,从音乐生成同步的双人舞蹈,在动作真实性、音乐对齐和舞伴协调方面实现了最先进的性能。
English: DuetGen is a novel framework that generates synchronized two-person dances from music using a two-stage process involving hierarchical motion tokenization and masked transformers, achieving state-of-the-art performance in realism, music alignment, and partner coordination.
Authors:Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu
Abstract:
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).
中文: 大型语言模型的知识受限于训练数据,检索增强生成通过外部知识库弥补这一不足,但其检索阶段因数据移动成为瓶颈;REIS系统针对性地优化存储内处理,通过高效数据布局和轻量级算法,大幅提升了检索性能和能效。
English: Large Language Models (LLMs) are limited by static training data, but Retrieval-Augmented Generation (RAG) enhances them with external knowledge, though its retrieval stage faces bottlenecks due to data movement; the proposed REIS system overcomes these by optimizing in-storage processing for efficient retrieval and ANNS, boosting performance and energy efficiency significantly.
Authors:Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Abstract:
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
中文: 提出的Ordered CommonGen基准评估大语言模型遵循概念顺序指令和组合泛化的能力,发现尽管模型能理解指令,但仍存在顺序偏见且仅实现约75%的顺序覆盖率,表明其能力亟需提升。
English: The proposed Ordered CommonGen benchmark evaluates LLMs' ability to follow concept order instructions and generalize compositionally, revealing that despite understanding instructions, models exhibit order biases and achieve only 75% ordered coverage, indicating significant room for improvement.
Authors:Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang
Abstract:
Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.
中文摘要:本研究探讨大型语言模型在错误信息被纠正时如何表达遗憾,提出新方法识别遗憾相关神经元并分析其激活模式,以提升模型可靠性。
English Summary: This study explores how large language models express regret when their misinformation is corrected, proposing new methods to identify regret-related neurons and analyze their activation patterns to enhance model reliability.
Authors:Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici
Abstract:
Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.
中文:LexiMark提出了一种新颖的水印技术,通过替换高熵词汇的同义词来增强大语言模型对水印文本的记忆能力,同时保持语义完整性,使其具有隐蔽性且难以检测和移除,经多个模型和训练场景验证,AUROC分数显著提升。
English: LexiMark introduces a novel watermarking technique using synonym substitutions for high-entropy words to enhance LLM memorization of watermarked text while preserving semantic integrity, making it stealthy and resistant to detection and removal, as validated by improved AUROC scores across multiple models and training settings.
Authors:Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici
Abstract:
Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.
中文:LexiMark提出了一种新颖的水印技术,通过替换高熵词汇的同义词来增强大语言模型对水印文本的记忆能力,同时保持语义完整性,使其具有隐蔽性且难以检测和移除,经多个模型和训练场景验证,AUROC分数显著提升。
English: LexiMark introduces a novel watermarking technique using synonym substitutions for high-entropy words to enhance LLM memorization of watermarked text while preserving semantic integrity, making it stealthy and resistant to detection and removal, as validated by improved AUROC scores across multiple models and training settings.
Authors:Xinglong Mao, Shifeng Liu, Sirui Zhao, Tong Xu, Hanchao Wang, Baozhi Jia, Enhong Chen
Abstract:
Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, offering valuable insights for psychological assessment and criminal investigations. Despite significant progress in automatic ME recognition (MER), existing methods still struggle to simultaneously capture localized muscle activations and global facial dependencies, both essential for decoding subtle emotional cues. To address this challenge, we propose MERba, a hierarchical multi-receptive field architecture specially designed for MER, which incorporates a series of Local-Global Feature Integration stages. Within each stage, detailed intra-window motion patterns are captured using MERba Local Extractors, which integrate MambaVision Mixers with a tailored asymmetric multi-scanning strategy to enhance local spatial sensitivity. These localized features are then aggregated through lightweight self-attention layers that explicitly model inter-window relationships, enabling effective global context construction. Furthermore, to mitigate the challenge of high inter-class similarity among negative MEs, we introduce a Dual-Granularity Classification Module that decomposes the recognition task into a coarse-to-fine paradigm. Extensive experiments on three benchmark datasets demonstrate that MERba consistently outperforms existing methods, with ablation studies confirming the effectiveness of each proposed component.
中文:MERba是一种用于微表情识别的分层架构,通过多感受野处理和双粒度分类模块整合局部与全局面部特征,在基准数据集上实现了优越性能。
English: MERba is a hierarchical architecture for micro-expression recognition that integrates local and global facial features through multi-receptive field processing and a dual-granularity classification module, achieving superior performance on benchmark datasets.
Authors:Omri Haller, Yair Meidan, Dudu Mimran, Yuval Elovici, Asaf Shabtai
Abstract:
Following recent advancements in large language models (LLMs), LLM-based chatbots have transformed customer support by automating interactions and providing consistent, scalable service. While LLM-based conversational recommender systems (CRSs) have attracted attention for their ability to enhance the quality of recommendations, limited research has addressed the implicit integration of recommendations within customer support interactions. In this work, we introduce ImpReSS, an implicit recommender system designed for customer support conversations. ImpReSS operates alongside existing support chatbots, where users report issues and chatbots provide solutions. Based on a customer support conversation, ImpReSS identifies opportunities to recommend relevant solution product categories (SPCs) that help resolve the issue or prevent its recurrence -- thereby also supporting business growth. Unlike traditional CRSs, ImpReSS functions entirely implicitly and does not rely on any assumption of a user's purchasing intent. Our empirical evaluation of ImpReSS's ability to recommend relevant SPCs that can help address issues raised in support conversations shows promising results, including an MRR@1 (and recall@3) of 0.72 (0.89) for general problem solving, 0.82 (0.83) for information security support, and 0.85 (0.67) for cybersecurity troubleshooting. To support future research, our data and code will be shared upon request.
中文: ImpReSS是一种隐式推荐系统,可在客户支持对话中自动推荐相关解决方案产品类别,无需预设用户购买意图即可提升问题解决效率并促进业务增长,在多个支持领域均展现出优异性能。
English: ImpReSS is an implicit recommender system that integrates with customer support chatbots to suggest relevant solution product categories during conversations, enhancing issue resolution and business growth without assuming user purchase intent, and demonstrates strong performance across various support domains.
Authors:Shahaf David, Yair Meidan, Ido Hersko, Daniel Varnovitzky, Dudu Mimran, Yuval Elovici, Asaf Shabtai
Abstract:
Despite significant advancements in conversational AI, large language model (LLM)-powered chatbots often struggle with personalizing their responses according to individual user characteristics, such as technical expertise, learning style, and communication preferences. This lack of personalization is particularly problematic in specialized knowledge-intense domains like IT/cybersecurity (ITSec), where user knowledge levels vary widely. Existing approaches for chatbot personalization primarily rely on static user categories or explicit self-reported information, limiting their adaptability to an evolving perception of the user's proficiency, obtained in the course of ongoing interactions. In this paper, we propose ProfiLLM, a novel framework for implicit and dynamic user profiling through chatbot interactions. This framework consists of a taxonomy that can be adapted for use in diverse domains and an LLM-based method for user profiling in terms of the taxonomy. To demonstrate ProfiLLM's effectiveness, we apply it in the ITSec domain where troubleshooting interactions are used to infer chatbot users' technical proficiency. Specifically, we developed ProfiLLM[ITSec], an ITSec-adapted variant of ProfiLLM, and evaluated its performance on 1,760 human-like chatbot conversations from 263 synthetic users. Results show that ProfiLLM[ITSec] rapidly and accurately infers ITSec profiles, reducing the gap between actual and predicted scores by up to 55--65\% after a single prompt, followed by minor fluctuations and further refinement. In addition to evaluating our new implicit and dynamic profiling framework, we also propose an LLM-based persona simulation methodology, a structured taxonomy for ITSec proficiency, our codebase, and a dataset of chatbot interactions to support future research.
Chinese: 尽管对话式人工智能取得了显著进展,但基于大语言模型的聊天机器人往往难以根据用户的技术专长、学习风格等特征进行个性化响应,为此提出的ProfiLLM框架通过交互动态推断用户画像,在IT安全领域应用中有效缩小了预测与实际能力的差距。
English: Despite advancements in conversational AI, chatbots often fail to personalize responses for individual user traits, particularly in specialized fields like IT security, leading to the development of ProfiLLM, a framework that dynamically infers user proficiency through interactions and significantly improves accuracy in predicting user profiles.
Authors:Qiming Ge, Shuhao Xing, Songyang Gao, Yunhua Zhou, Yicheng Zou, Songyang Zhang, Zhi Chen, Hang Yan, Qi Zhang, Qipeng Guo, Kai Chen
Abstract:
Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model's downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model's capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.
中文摘要:本文提出的能力显著向量通过分解总体损失并为不同标记分配重要性权重,有效弥合了验证损失与下游任务能力之间的差距,显著提升了语言模型性能的可预测性。
English Summary: The Capability Salience Vector is introduced to bridge the gap between validation loss and downstream task performance by decomposing overall loss and assigning importance weights to tokens, significantly improving language model performance predictability.
Authors:Ismail Emir Yuksel, Akash Sood, Ataberk Olgun, OÄuzhan Canpolat, Haocong Luo, F. Nisa Bostancı, Mohammad Sadrosadati, A. Giray YaÄlıkçı, Onur Mutlu
Abstract:
Processing-using-DRAM (PuD) is a promising paradigm for alleviating the data movement bottleneck using DRAM's massive internal parallelism and bandwidth to execute very wide operations. Performing a PuD operation involves activating multiple DRAM rows in quick succession or simultaneously, i.e., multiple-row activation. Multiple-row activation is fundamentally different from conventional memory access patterns that activate one DRAM row at a time. However, repeatedly activating even one DRAM row (e.g., RowHammer) can induce bitflips in unaccessed DRAM rows because modern DRAM is subject to read disturbance. Unfortunately, no prior work investigates the effects of multiple-row activation on DRAM read disturbance.
In this paper, we present the first characterization study of read disturbance effects of multiple-row activation-based PuD (which we call PuDHammer) using 316 real DDR4 DRAM chips from four major DRAM manufacturers. Our detailed characterization show that 1) PuDHammer significantly exacerbates the read disturbance vulnerability, causing up to 158.58x reduction in the minimum hammer count required to induce the first bitflip ($HC_{first}$), compared to RowHammer, 2) PuDHammer is affected by various operational conditions and parameters, 3) combining RowHammer with PuDHammer is more effective than using RowHammer alone to induce read disturbance error, e.g., doing so reduces $HC_{first}$ by 1.66x on average, and 4) PuDHammer bypasses an in-DRAM RowHammer mitigation mechanism (Target Row Refresh) and induces more bitflips than RowHammer.
To develop future robust PuD-enabled systems in the presence of PuDHammer, we 1) develop three countermeasures and 2) adapt and evaluate the state-of-the-art RowHammer mitigation standardized by industry, called Per Row Activation Counting (PRAC). We show that the adapted PRAC incurs large performance overheads (48.26%, on average).
Chinese: PuDHammer作为一种利用DRAM处理的技术,通过大幅降低引发位翻转所需的锤击次数并绕过现有RowHammer防护机制,显著加剧了读取干扰漏洞,亟需开发新的应对策略以保障系统稳定性。
English: PuDHammer, a processing-using-DRAM technique, significantly worsens read disturbance vulnerability by reducing the hammer count needed to cause bitflips and bypassing existing RowHammer mitigations, requiring new countermeasures to ensure system robustness.
Authors:Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H. Abdi, Yuqing Yang, Lili Qiu
Abstract:
Large language models (LLMs) have achieved widespread adoption across numerous applications. However, many LLMs are vulnerable to malicious attacks even after safety alignment. These attacks typically bypass LLMs' safety guardrails by wrapping the original malicious instructions inside adversarial jailbreaks prompts. Previous research has proposed methods such as adversarial training and prompt rephrasing to mitigate these safety vulnerabilities, but these methods often reduce the utility of LLMs or lead to significant computational overhead and online latency. In this paper, we propose SecurityLingua, an effective and efficient approach to defend LLMs against jailbreak attacks via security-oriented prompt compression. Specifically, we train a prompt compressor designed to discern the "true intention" of the input prompt, with a particular focus on detecting the malicious intentions of adversarial prompts. Then, in addition to the original prompt, the intention is passed via the system prompt to the target LLM to help it identify the true intention of the request. SecurityLingua ensures a consistent user experience by leaving the original input prompt intact while revealing the user's potentially malicious intention and stimulating the built-in safety guardrails of the LLM. Moreover, thanks to prompt compression, SecurityLingua incurs only a negligible overhead and extra token cost compared to all existing defense methods, making it an especially practical solution for LLM defense. Experimental results demonstrate that SecurityLingua can effectively defend against malicious attacks and maintain utility of the LLM with negligible compute and latency overhead. Our code is available at https://aka.ms/SecurityLingua.
中文: SecurityLingua通过压缩提示词检测恶意意图来防御大语言模型的越狱攻击,在保持实用性的同时仅产生可忽略的开销。
English: SecurityLingua defends LLMs against jailbreak attacks by compressing prompts to detect malicious intent, maintaining utility with minimal overhead.
Authors:Melina Soysal, Konstantina Koliogeorgi, Can Firtina, Nika Mansouri Ghiasi, Rakesh Nadig, Haiyu Mao, Geraldo F. Oliveira, Yu Liang, Klea Zambaku, Mohammad Sadrosadati, Onur Mutlu
Abstract:
Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.
Chinese: MARS是一种以存储为中心的系统,通过利用近存储处理和软硬件协同设计解决原始信号基因组分析中的I/O瓶颈,相比现有方法实现了显著的加速和能效提升。
English: MARS is a storage-centric system that addresses the I/O bottleneck in raw signal genome analysis by leveraging in-storage processing and hardware/software co-design, achieving significant speedup and energy efficiency improvements over existing methods.
Authors:Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe
Abstract:
Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech--text training in this study. We use interleaved speech--text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.
中文: 本研究提出了一种预定交错语音-文本训练方法,通过在训练中逐步减少文本比例来解决语音翻译中的模态适应问题,尤其提升了低资源语言的翻译性能。
English: This study proposes a scheduled interleaved speech-text training method to address modality adaptation challenges in speech-to-speech translation by gradually reducing text ratios during training, which improves performance especially for low-resource languages.
Authors:Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
Abstract:
Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.
Chinese: Bingo是一个强化学习框架,通过引入显著性感知和动态长度奖励机制,在减少不必要标记的同时保持或提高准确性,从而提升大型语言模型的推理效率。
English: Bingo is an RL framework that enhances reasoning efficiency in large language models by using significance-aware and dynamic length rewards to reduce unnecessary tokens while maintaining or improving accuracy.
Authors:Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs.
中文: 针对现有基准难以全面评估多模态大模型空间智能的问题,SpaCE-10基准通过定义10项原子空间能力和8项组合能力,构建了包含5,000余对问答数据的评估体系,发现现有模型在计数能力等关键方面仍远逊于人类表现。
English: To address the limitations of existing benchmarks in evaluating multimodal large language models' spatial intelligence, the SpaCE-10 benchmark introduces 10 atomic and 8 compositional spatial capabilities with over 5,000 QA pairs, revealing that current models significantly trail human performance, particularly in counting skills.
Authors:Liangliang You, Junchi Yao, Shu Yang, Guimin Hu, Lijie Hu, Di Wang
Abstract:
While multimodal large language models excel at various tasks, they still suffer from hallucinations, which limit their reliability and scalability for broader domain applications. To address this issue, recent research mainly focuses on objective hallucination. However, for sequential images, besides objective hallucination, there is also behavioral hallucination, which is less studied. This work aims to fill in the gap. We first reveal that behavioral hallucinations mainly arise from two key factors: prior-driven bias and the snowball effect. Based on these observations, we introduce SHE (Sequence Hallucination Eradication), a lightweight, two-stage framework that (1) detects hallucinations via visual-textual alignment check using our proposed adaptive temporal window and (2) mitigates them via orthogonal projection onto the joint embedding space. We also propose a new metric (BEACH) to quantify behavioral hallucination severity. Empirical results on standard benchmarks demonstrate that SHE reduces behavioral hallucination by over 10% on BEACH while maintaining descriptive accuracy.
Chinese: 本文提出SHE轻量级框架,通过检测和缓解由先验驱动偏差及雪球效应引发的序列图像行为幻觉,在BEACH新指标上降低超过10%,同时保持描述准确性。
English: This paper introduces SHE, a lightweight framework that detects and mitigates behavioral hallucinations in sequential images caused by prior-driven bias and the snowball effect, reducing them by over 10% on the new BEACH metric while preserving accuracy.
Authors:Wenrui Zhou, Shu Yang, Qingsong Yang, Zikun Guo, Lijie Hu, Di Wang
Abstract:
As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first dedicated benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the visual domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. In addition, we explore key-frame selection as an interpretable, training-free mitigation strategy, which reveals potential paths for reducing sycophantic bias by strengthening visual grounding.
中文摘要:视频大语言模型常出现迎合用户输入而忽视视觉证据的附和倾向,为此我们创建了VISE基准来评估该行为,并提出通过关键帧选择这一无需训练的方法来加强视觉基础以降低附和偏差。
English Summary: Video-LLMs often exhibit sycophancy by prioritizing user input over visual evidence, so we developed the VISE benchmark to evaluate and mitigate this behavior through key-frame selection to enhance visual grounding.
Authors:Huanyi Xie, Lijie Hu, Lu Yu, Tianhao Huang, Longfei Li, Meng Li, Jun Zhou, Huan Wang, Di Wang
Abstract:
In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs) to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.
中文: GAGA是一种高效的文本属性图学习框架,通过构建标注图并进行结构对齐,仅需1%的标注数据即可达到或超越最先进方法的分类准确率。
English: GAGA is an efficient framework for Text-attributed Graphs that achieves state-of-the-art classification accuracy with only 1% annotated data by constructing an annotation graph and employing structural alignment.
Authors:Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, Minjia Zhang
Abstract:
Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.
中文: 本文提出了首个用于设计和评估医疗任务中基于大语言模型的引文生成的端到端框架,采用新颖的检索-引文方法,显著提高了引文的精确度和召回率,并与专家评估结果高度一致。
English: This paper introduces the first end-to-end framework for designing and evaluating LLM-based citation generation in medical tasks, featuring a novel retrieval-citation method that significantly improves precision and recall while aligning with expert annotations.
Authors:Lijie Hu, Songning Lai, Yuan Hua, Shu Yang, Jingfeng Zhang, Di Wang
Abstract:
Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model's latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model's predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.
中文摘要:本文提出视觉概念转换器(VCT)及其增强版稳定视觉概念转换器(SVCT),通过融合概念特征与图像嵌入特征解决现有概念瓶颈模型的效用缺陷,并利用去噪扩散平滑技术确保模型解释的稳定性,在医疗数据实验中同时实现了准确性与可解释性。
English Summary: This paper introduces the Vision Concept Transformer (VCT) and its enhanced version, Stable Vision Concept Transformer (SVCT), which address limitations in existing Concept Bottleneck Models by integrating conceptual features with image embeddings and ensuring stable explanations through denoised diffusion smoothing, achieving both accuracy and interpretability in medical applications.
Authors:Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu
Abstract:
Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $β_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/η$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/η$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.
中文摘要:Adam优化器训练中的损失尖峰源于其自适应预条件子在平方梯度显著小于二阶矩估计时引发的不稳定性,导致预条件Hessian矩阵的最大特征值突破经典稳定阈值而引发的梯度与最大特征方向对齐现象。
English Summary: Loss spikes during Adam optimization are caused by the optimizer's adaptive preconditioners triggering instability when squared gradients fall significantly below second-moment estimates, pushing the preconditioned Hessian's maximum eigenvalue beyond stability thresholds.
Authors:Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che
Abstract:
Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.
中文: 该摘要提出Tool-MVR模型,通过多智能体验证构建高质量工具指令数据集和探索式反思学习强化错误修正能力,显著提升大语言模型的工具调用可靠性与反思效能,在基准测试中表现卓越并降低资源消耗。
English: This abstract introduces Tool-MVR, a tool-augmented large language model that enhances AI agents' problem-solving by improving tool planning through high-quality dataset validation and boosting error correction via dynamic reflection learning, achieving state-of-the-art performance and efficiency.
Authors:Zhizheng Wang, Chi-Ping Day, Chih-Hsuan Wei, Qiao Jin, Robert Leaman, Yifan Yang, Shubo Tian, Aodong Qiu, Yin Fang, Qingqing Zhu, Xinghua Lu, Zhiyong Lu
Abstract:
Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.
中文: cGSA框架通过整合人工智能和情境感知的优先级排序,克服了传统基因集分析的局限,在多种疾病中显著提升了通路相关性和可解释性。
English: The cGSA framework overcomes limitations of traditional gene set analysis by integrating AI and context-aware prioritization, significantly improving pathway relevance and interpretability across multiple diseases.
Authors:Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
Abstract:
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
中文: DenseDPO通过从损坏的真实视频生成对齐的视频对来消除运动偏差,并利用片段级偏好提供更密集的训练信号,从而在仅需三分之一标注数据的情况下显著提升文本到视频模型的运动生成质量,同时还能借助视觉语言模型实现自动标注。
English: DenseDPO enhances Direct Preference Optimization for text-to-video models by generating aligned video pairs from corrupted ground truth clips to eliminate motion bias and using segment-level preferences for denser training signals, achieving superior motion generation with fewer labels and even enabling automatic annotation via Vision Language Models.
Authors:Parth Atulbhai Gandhi, Akansha Shukla, David Tayouri, Beni Ifland, Yuval Elovici, Rami Puzis, Asaf Shabtai
Abstract:
Evaluating the security of multi-agent systems (MASs) powered by large language models (LLMs) is challenging, primarily because of the systems' complex internal dynamics and the evolving nature of LLM vulnerabilities. Traditional attack graph (AG) methods often lack the specific capabilities to model attacks on LLMs. This paper introduces AI-agent application Threat assessment with Attack Graphs (ATAG), a novel framework designed to systematically analyze the security risks associated with AI-agent applications. ATAG extends the MulVAL logic-based AG generation tool with custom facts and interaction rules to accurately represent AI-agent topologies, vulnerabilities, and attack scenarios. As part of this research, we also created the LLM vulnerability database (LVD) to initiate the process of standardizing LLM vulnerabilities documentation. To demonstrate ATAG's efficacy, we applied it to two multi-agent applications. Our case studies demonstrated the framework's ability to model and generate AGs for sophisticated, multi-step attack scenarios exploiting vulnerabilities such as prompt injection, excessive agency, sensitive information disclosure, and insecure output handling across interconnected agents. ATAG is an important step toward a robust methodology and toolset to help understand, visualize, and prioritize complex attack paths in multi-agent AI systems (MAASs). It facilitates proactive identification and mitigation of AI-agent threats in multi-agent applications.
中文: 本文提出ATAG框架,通过扩展传统攻击图来系统评估多智能体AI系统的安全风险,能够对智能体拓扑和LLM漏洞进行建模,多步骤攻击案例研究验证了其有效性。
English: This paper introduces ATAG, a novel framework that extends traditional attack graphs to systematically assess security risks in multi-agent AI systems by modeling agent topologies and LLM vulnerabilities, as demonstrated through case studies of multi-step attacks.
Authors:Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
Abstract:
Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.
Chinese: 作者提出了VS-Bench这一多模态基准,用于评估视觉语言模型在多智能体环境中的战略能力,发现当前模型虽在感知方面表现优异,但在战略推理和决策制定方面仍存在显著差距。
English: The authors introduce VS-Bench, a multimodal benchmark for evaluating Vision Language Models' strategic abilities in multi-agent environments, revealing that while current models excel in perception, they significantly lag in strategic reasoning and decision-making.
Authors:Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang
Abstract:
Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.
Chinese: 作者提出了VS-Bench这一多模态基准,用于评估视觉语言模型在多智能体环境中的战略能力,发现当前模型虽在感知方面表现优异,但在战略推理和决策制定方面仍存在显著差距。
English: The authors introduce VS-Bench, a multimodal benchmark for evaluating Vision Language Models' strategic abilities in multi-agent environments, revealing that while current models excel in perception, they significantly lag in strategic reasoning and decision-making.
Authors:Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, Taro Watanabe
Abstract:
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
中文: 本研究提出了一种基于词典的跨语言词汇迁移方法,通过逐步移除BPE分词器中的目标子词来迭代估计其嵌入向量,实验证明该方法在低资源语言上优于现有方法。
English: This study introduces a straightforward yet effective dictionary-based method for cross-lingual vocabulary transfer, which outperforms existing approaches for low-resource languages by iteratively estimating target subword embeddings through progressive removal in BPE tokenizers.
Authors:Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
Abstract:
Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.
中文摘要:SuperRL作为一种创新框架,在稀疏奖励环境下通过动态切换强化学习与监督微调来利用离线数据,从而在推理任务中实现比传统强化学习更优的样本效率与泛化能力。
English Summary: SuperRL is a novel framework that dynamically switches between reinforcement learning and supervised fine-tuning to enhance reasoning tasks by leveraging offline data when rewards are sparse, outperforming standard RL in efficiency and robustness.
Authors:Wenshuo Dong, Qingsong Yang, Shu Yang, Lijie Hu, Meng Ding, Wanyu Lin, Tianhang Zheng, Di Wang
Abstract:
Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data and user queries are in English. We show that they cannot defend against the privacy leakage in cross-lingual contexts: even if the training data is exclusively in one language, these (private) models may still reveal private information when queried in another language. In this work, we first investigate the information flow of cross-lingual privacy leakage to give a better understanding. We find that LLMs process private information in the middle layers, where representations are largely shared across languages. The risk of leakage peaks when converted to a language-specific space in later layers. Based on this, we identify privacy-universal neurons and language-specific privacy neurons. Privacy-universal neurons influence privacy leakage across all languages, while language-specific privacy neurons are only related to specific languages. By deactivating these neurons, the cross-lingual privacy leakage risk is reduced by 23.3%-31.6%.
中文: 大型语言模型存在跨语言隐私泄露风险,即用不同语言查询时可能泄露私人信息,但通过停用已识别的隐私神经元,可将该风险降低23.3%-31.6%。
English: Large Language Models risk cross-lingual privacy leakage, where private information can be exposed when queried in different languages, but deactivating identified privacy neurons reduces this risk by 23.3%-31.6%.
Authors:William Andrew Simon, Leonid Yavits, Konstantina Koliogeorgi, Yann Falevoz, Yoshihiro Shibuya, Dominique Lavenier, Irem Boybat, Klea Zambaku, Berkan Åahin, Mohammad Sadrosadati, Onur Mutlu, Abu Sebastian, Rayan Chikhi, The BioPIM Consortium, Can Alkan
Abstract:
Low-cost, high-throughput DNA and RNA sequencing (HTS) data is the main workforce for the life sciences. Genome sequencing is now becoming a part of Predictive, Preventive, Personalized, and Participatory (termed 'P4') medicine. All genomic data are currently processed in energy-hungry computer clusters and centers, necessitating data transfer, consuming substantial energy, and wasting valuable time. Therefore, there is a need for fast, energy-efficient, and cost-efficient technologies that enable genomics research without requiring data centers and cloud platforms. We recently started the BioPIM Project to leverage the emerging processing-in-memory (PIM) technologies to enable energy and cost-efficient analysis of bioinformatics workloads. The BioPIM Project focuses on co-designing algorithms and data structures commonly used in genomics with several PIM architectures for the highest cost, energy, and time savings benefit.
中文: 高通量测序是P4医学的关键,但依赖高能耗数据中心,因此BioPIM项目致力于开发内存处理技术以实现高效基因组分析。
English: High-throughput sequencing is vital for P4 medicine but relies on energy-intensive data centers, prompting the BioPIM Project to develop processing-in-memory technologies for efficient genomic analysis.
Authors:Kejia Chen, Celina Dettmering, Florian Pachler, Zhuo Liu, Yue Zhang, Tailai Cheng, Jonas Dirr, Zhenshan Bing, Alois Knoll, Rüdiger Daub
Abstract:
Industrial assembly of deformable linear objects (DLOs) such as cables offers great potential for many industries. However, DLOs pose several challenges for robot-based automation due to the inherent complexity of deformation and, consequentially, the difficulties in anticipating the behavior of DLOs in dynamic situations. Although existing studies have addressed isolated subproblems like shape tracking, grasping, and shape control, there has been limited exploration of integrated workflows that combine these individual processes. To address this gap, we propose an object-centric perception and planning framework to achieve a comprehensive DLO assembly process throughout the industrial value chain. The framework utilizes visual and tactile information to track the DLO's shape as well as contact state across different stages, which facilitates effective planning of robot actions. Our approach encompasses robot-based bin picking of DLOs from cluttered environments, followed by a coordinated handover to two additional robots that mount the DLOs onto designated fixtures. Real-world experiments employing a setup with multiple robots demonstrate the effectiveness of the approach and its relevance to industrial scenarios.
中文: 该摘要提出了一种以物体为中心的感知与规划框架,通过融合视觉和触觉信息实现柔性线状物体的全流程机器人装配,利用多机器人系统解决了抓取、跟踪与控制等集成难题,并在实际工业场景中验证了有效性。
English: The abstract introduces an object-centric perception and planning framework that integrates visual and tactile data to enable comprehensive robotic assembly of deformable linear objects, addressing challenges in tracking, grasping, and control through a multi-robot system validated in real-world industrial experiments.
Authors:Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Abstract:
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.
Chinese: BrokenVideos数据集通过提供3,254个带有像素级视觉伪影标注的AI生成视频,填补了该领域缺乏专门基准的空白,显著提升了模型在检测和定位伪影方面的性能。
English: The BrokenVideos dataset addresses the lack of a comprehensive benchmark for artifact localization in AI-generated videos by providing 3,254 videos with pixel-level annotations of visual corruptions, significantly improving model performance in detecting and localizing these artifacts.
Authors:Zhihao Yuan, Shuyi Jiang, Chun-Mei Feng, Yaolun Zhang, Shuguang Cui, Zhen Li, Na Zhao
Abstract:
Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.
中文: Scene-R1是一种基于视频的框架,通过强化学习和两阶段定位流程,无需3D实例监督即可实现透明的三维场景理解,在超越现有方法的同时提供可解释的推理过程。
English: Scene-R1 is a novel video-grounded framework that uses reinforcement learning and a two-stage grounding pipeline to enable transparent 3D scene understanding without 3D instance supervision, outperforming existing methods while providing interpretable rationales.
Authors:Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang
Abstract:
In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities.
While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code benchmarks. We first propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure as much as possible. Based on the framework, we conduct extensive experiments across eight code benchmark tasks on 10 representative open-source LLMs, with each task featuring 100 semantically similar prompt templates. We then analyze the evaluation results using various statistical metrics, focusing on both absolute and relative model performance. Our findings suggest that even slight prompt variations can lead to significant shifts in performance. Additionally, we observe that such variations can introduce inconsistencies in the performance rankings across different models. These insights highlight the need for considering prompt sensitivity when designing future code benchmarks, to ensure more reliable and accurate evaluation of LLM capabilities.
中文摘要:本研究表明代码基准对提示模板的微小变化高度敏感,会导致模型性能显著波动和排名不一致,强调未来设计需考虑提示敏感性以确保评估可靠性。
English Summary: This study reveals that code benchmarks are highly sensitive to minor prompt variations, causing significant performance fluctuations and inconsistent model rankings, necessitating careful prompt design for reliable LLM evaluations.
Authors:Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo
Abstract:
Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.
中文摘要:本文提出CINGS后训练监督方法,通过将相关上下文预置到响应前并仅计算响应标记的损失,增强大语言模型对外部信息的依赖,在文本和视觉语言领域均能有效减少幻觉并保持事实一致性,且不影响常规任务表现。
English Summary: The paper introduces CINGS, a post-training method that enhances large language models' grounding in external context by training them with prepended relevant context while computing loss only on response tokens, leading to reduced hallucinations and improved factual consistency in both text and vision-language domains without compromising general performance.
Authors:Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, Xingjun Ma, Yu-Gang Jiang
Abstract:
As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
中文: 本文提出首个配备缺陷标注的AI生成视频数据集DAVID-X及其视频语言模型DAVID-XR1,通过视觉推理链实现可解释的检测,将视频真伪鉴别转变为透明可验证的诊断流程。
English: This paper introduces DAVID-X, the first dataset with detailed defect annotations for AI-generated videos, and DAVID-XR1, a model that provides transparent, interpretable detection through visual reasoning and explanations, transforming video authentication into a verifiable diagnostic process.
Authors:Yueru Luo, Changqing Zhou, Yiming Yang, Erlong Li, Chao Zheng, Shuqi Mei, Shuguang Cui, Zhen Li
Abstract:
Accurate road topology reasoning is critical for autonomous driving, enabling effective navigation and adherence to traffic regulations. Central to this task are lane perception and topology reasoning. However, existing methods typically focus on either lane detection or Lane-to-Lane (L2L) topology reasoning, often \textit{neglecting} Lane-to-Traffic-element (L2T) relationships or \textit{failing} to optimize these tasks jointly. Furthermore, most approaches either overlook relational modeling or apply it in a limited scope, despite the inherent spatial relationships among road elements. We argue that relational modeling is beneficial for both perception and reasoning, as humans naturally leverage contextual relationships for road element recognition and their connectivity inference. To this end, we introduce relational modeling into both perception and reasoning, \textit{jointly} enhancing structural understanding. Specifically, we propose: 1) a relation-aware lane detector, where our geometry-biased self-attention and \curve\ cross-attention refine lane representations by capturing relational dependencies; 2) relation-enhanced topology heads, including a geometry-enhanced L2L head and a cross-view L2T head, boosting reasoning with relational cues; and 3) a contrastive learning strategy with InfoNCE loss to regularize relationship embeddings. Extensive experiments on OpenLane-V2 demonstrate that our approach significantly improves both detection and topology reasoning metrics, achieving +3.1 in DET$_l$, +5.3 in TOP$_{ll}$, +4.9 in TOP$_{lt}$, and an overall +4.4 in OLS, setting a new state-of-the-art. Code will be released.
中文: 本研究提出一种关系建模方法,通过关系感知的车道检测器和拓扑推理头联合优化车道感知与拓扑关系,在OpenLane-V2基准测试中各项指标显著提升,创造了最新性能记录。
English: This study introduces a relational modeling approach that jointly enhances lane detection and topology reasoning for autonomous driving, achieving state-of-the-art performance on the OpenLane-V2 benchmark through relation-aware components and contrastive learning.
Authors:Jiaming Zhang, Xin Wang, Xingjun Ma, Lingyu Qiu, Yu-Gang Jiang, Jitao Sang
Abstract:
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations include: (1) extending AdvPT from text-only to multi-modal prompting across both text and visual modalities, (2) expanding from single-layer to multi-layer prompt architectures, and (3) proposing a novel architecture-level redesign through our Neural Augmentor approach, which implements feature purification to directly address the distortions introduced by adversarial attacks in feature space. Our NAP-Tuning approach incorporates token refiners that learn to reconstruct purified features through residual connections, allowing for modality-specific and layer-specific feature correction.Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.
中文: 视觉语言模型存在对抗攻击的安全隐患,而提出的神经增强多模态对抗提示调优框架通过跨模态提示扩展和特征净化机制,在保持精度的同时将抗攻击能力提升了超过33%,显著增强了模型鲁棒性。
English: Vision-Language Models face security vulnerabilities from adversarial attacks, but the proposed Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning) significantly enhances robustness by extending prompt tuning across modalities and implementing feature purification, achieving over 33% improvement against attacks while maintaining accuracy.
Authors:Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang
Abstract:
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 960 long videos (with an average duration of 1.6 hours), along with 8,243 human-labeled multi-step question-answering pairs and 25,106 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 19 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.
中文: VRBench推出了首个长叙事视频基准,包含960个长视频和大量人工标注数据,通过人机协作框架和多阶段评估,全面评测大模型的多步骤时序推理能力。
English: VRBench introduces the first long narrative video benchmark with 960 lengthy videos and extensive human-labeled data to evaluate large models' multi-step temporal reasoning through a comprehensive human-AI framework and multi-phase assessment.
Authors:Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia
Abstract:
With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs' characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the distributed nature of these frameworks introduces unique bug root causes, such as allocation strategy error and distributed communication error. Diagnosing and fixing complex bugs remains challenging due to factors like the disconnect between symptoms and root causes, high bug reproduction costs, and low-level or cross-component interactions. Interestingly, we observe that 48% of bug fixes require minimal code changes (<=10 LOC) and follow simple strategies such as conditional logic optimization, parameter handling enhancement, or version compatibility handling, indicating potential for automation. Based on these insights, we offer several implications for improving the reliability of both distributed training and inference frameworks and their dependent LLM projects, while also identifying opportunities to leverage LLM-based tools for automated debugging and repair.
中文: 本研究首次对DeepSpeed等分布式训练框架中的308个错误进行大规模实证分析,发现近半数修复仅需极少代码改动,揭示了分布式系统特有挑战及自动化潜力。
English: This study conducts the first large-scale empirical analysis of 308 bugs in distributed training frameworks like DeepSpeed, revealing that nearly half of the fixes require minimal code changes and identifying unique distributed-system challenges and automation opportunities.
Authors:Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang
Abstract:
Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.
中文:GenBreak框架通过微调红队大语言模型来生成能有效规避安全过滤并产生有害图像的对抗性提示,揭示了商业文生图模型的关键安全缺陷。
English: The GenBreak framework fine-tunes a red-team LLM to craft adversarial prompts that effectively bypass safety filters and generate toxic images, exposing critical vulnerabilities in commercial text-to-image models.
Authors:Jiachen Hu, Rui Ai, Han Zhong, Xiaoyu Chen, Liwei Wang, Zhaoran Wang, Zhuoran Yang
Abstract:
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $ε$-optimal policy with a tight sample complexity of $O(1/ε^2)$.
中文摘要:本文提出一种样本高效算法,解决了在线强化学习中的信息不对称和知识迁移问题,能以O(1/ε²)的样本复杂度实现ε最优策略。
English Summary: This paper introduces a sample-efficient algorithm that addresses information asymmetry and knowledge transfer challenges in online reinforcement learning, achieving an ε-optimal policy with O(1/ε²) sample complexity.
Authors:Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang
Abstract:
Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multi-modal foundation models have shown such potential via large-scale pretraining. However, these models simply align encoders of different modalities via contrastive learning, while lacking deeper multi-modal interactions, which is critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through recursive association of multi-modal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as "super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multi-modal interactions, for prompting various video understanding tasks in downstream. Extensive experiments show that, our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, temporal coherence(TC) drops 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases 4.1% compared to the popular TuneA-Video approach.
中文摘要:提出的超级编码网络(SEN)通过将多模态编码器视为“超级神经元”进行递归关联,实现了更深层次的多模态交互,显著提升了跟踪、识别、对话和编辑等视频理解任务的性能。
English Summary: The proposed Super Encoding Network (SEN) enhances video understanding by recursively associating multi-modal encoders as "super neurons" to enable deeper interactions, significantly improving performance across tracking, recognition, chatting, and editing tasks.
Authors:Emmanouil Zaranis, António Farinhas, Saul Santos, Beatriz Canaverde, Miguel Moura Ramos, Aditya K Surikuchi, André Viveiros, Baohao Liao, Elena Bueno-Benito, Nithin Sivakumaran, Pavlo Vasylenko, Shoubin Yu, Sonal Sannigrahi, Wafaa Mohammed, Ben Peters, Danae Sánchez Villegas, Elias Stengel-Eskin, Giuseppe Attanasio, Jaehong Yoon, Stella Frank, Alessandro Suglia, Chrysoula Zerva, Desmond Elliott, Mariella Dimiccoli, Mohit Bansal, Oswald Lanz, Raffaella Bernardi, Raquel Fernández, Sandro Pezzelle, Vlad Niculae, André F. T. Martins
Abstract:
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.
中文摘要:MF$^2$基准通过二元声明验证评估视觉语言模型对完整电影中核心叙事元素的理解与记忆能力,揭示了当前模型与人类在关键信息处理上存在显著差距。
English Summary: The MF$^2$ benchmark evaluates vision-language models' ability to comprehend and recall key narrative elements from full-length movies through binary claim verification, revealing significant performance gaps between current models and human capabilities.
Authors:Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang
Abstract:
The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.
中文: VideoChat-A1采用创新的链式镜头推理机制,通过逐步筛选和深入分析相关视频片段,在长视频理解任务中实现了最优性能,同时大幅降低了计算资源需求。
English: VideoChat-A1 introduces a novel agent paradigm with chain-of-shot reasoning that progressively selects and analyzes relevant video shots, achieving state-of-the-art performance in long video understanding while significantly reducing computational costs.
Authors:Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, Ido Dagan
Abstract:
A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking'' of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process -- one that both decomposes the text into smaller facts and also finds evidence in the source for each fact -- allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.
中文: 本研究提出了一种系统化推理方法,将生成文本分解为子主张并逐项对照源材料验证,通过更细粒度的蕴涵判断显著提升了幻觉检测性能。
English: This study proposes a systematic reasoning approach that decomposes generated text into sub-claims and verifies each against source material, significantly improving hallucination detection through finer-grained entailment decisions.
Authors:Alexander Huang-Menders, Xinhang Liu, Andy Xu, Yuyao Zhang, Chi-Keung Tang, Yu-Wing Tai
Abstract:
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.
中文: SmartAvatar是一个基于视觉语言模型的智能框架,能通过单张照片或文本提示生成骨骼绑定、动画就绪的3D虚拟人像,并借助自主验证循环实现自然语言交互式优化。
English: SmartAvatar is an AI-driven framework that creates fully rigged, animation-ready 3D human avatars from a single image or text prompt by leveraging vision-language models and an autonomous verification loop for iterative refinement through natural language.
Authors:Muling Wu, Qi Qian, Wenhao Liu, Xiaohua Wang, Zisu Huang, Di Liang, LI Miao, Shihan Dou, Changze Lv, Zhenghua Wang, Zhibo Xu, Lina Chen, Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
Abstract:
Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing. To address these limitations, we propose Customized Curriculum Learning (CCL), a novel framework with two key innovations. First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics. Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance. Comprehensive experiments on supervised fine-tuning and reinforcement learning demonstrate that CCL significantly outperforms uniform training approaches across five mathematical reasoning benchmarks, confirming its effectiveness across both paradigms in enhancing sample utilization and model performance.
中文摘要:提出的定制课程学习(CCL)框架通过模型自适应难度定义和引导式提示两大创新,有效解决了大语言模型训练中的样本利用问题,在数学推理任务中显著优于传统均匀训练方法。
English Summary: The proposed Customized Curriculum Learning (CCL) framework addresses limitations in LLM training by introducing model-adaptive difficulty customization and guided prompting, significantly outperforming uniform training across mathematical reasoning benchmarks.
Authors:Eran Hirsch, Aviv Slobodkin, David Wan, Elias Stengel-Eskin, Mohit Bansal, Ido Dagan
Abstract:
Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy. Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims. In contrast, existing sub-sentence attribution methods may be more precise but fail to align with users' interests. In light of these limitations, we introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution. We compare two approaches for the LAQuer task, including prompting large language models (LLMs) and leveraging LLM internal representations. We then explore a modeling framework that extends existing attributed text generation methods to LAQuer. We evaluate this framework across two grounded text generation tasks: Multi-document Summarization (MDS) and Long-form Question Answering (LFQA). Our findings show that LAQuer methods significantly reduce the length of the attributed text. Our contributions include: (1) proposing the LAQuer task to enhance attribution usability, (2) suggesting a modeling framework and benchmarking multiple baselines, and (3) proposing a new evaluation setting to promote future research on localized attribution in content-grounded generation.
中文摘要:本研究提出局部归因查询(LAQuer)新任务,通过将生成文本的特定片段精准关联至源材料,在多文档摘要和长式问答任务中显著缩短归因文本长度并提升实用性。
English Summary: The study introduces Localized Attribution Queries (LAQuer), a task enabling fine-grained attribution of specific text spans to source materials, and demonstrates its effectiveness in reducing attributed text length while improving usability across summarization and question answering tasks.
Authors:Mert Kiray, Paul Uhlenbruck, Nassir Navab, Benjamin Busam
Abstract:
Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further.
中文: 本研究提出了一种文本驱动框架,利用大型语言和视觉语言模型即时生成3D视觉效果,通过解析指令操控3D高斯属性,无需复杂模拟或专业技能即可实现实时动画。
English: This study introduces a text-driven framework that uses large language and vision-language models to instantly generate 3D visual effects by interpreting prompts and manipulating 3D Gaussians, enabling real-time animation without complex simulations or specialized expertise.
Authors:Mengdi Liu, Xiaoxue Cheng, Zhangyang Gao, Hong Chang, Cheng Tan, Shiguang Shan, Xilin Chen
Abstract:
Designing protein sequences that fold into a target 3D structure, known as protein inverse folding, is a fundamental challenge in protein engineering. While recent deep learning methods have achieved impressive performance by recovering native sequences, they often overlook the one-to-many nature of the problem: multiple diverse sequences can fold into the same structure. This motivates the need for a generative model capable of designing diverse sequences while preserving structural consistency. To address this trade-off, we introduce ProtInvTree, the first reward-guided tree-search framework for protein inverse folding. ProtInvTree reformulates sequence generation as a deliberate, step-wise decision-making process, enabling the exploration of multiple design paths and exploitation of promising candidates through self-evaluation, lookahead, and backtracking. We propose a two-stage focus-and-grounding action mechanism that decouples position selection and residue generation. To efficiently evaluate intermediate states, we introduce a jumpy denoising strategy that avoids full rollouts. Built upon pretrained protein language models, ProtInvTree supports flexible test-time scaling by expanding the search depth and breadth without retraining. Empirically, ProtInvTree outperforms state-of-the-art baselines across multiple benchmarks, generating structurally consistent yet diverse sequences, including those far from the native ground truth.
中文: ProtInvTree提出了首个奖励引导的树搜索框架,用于蛋白质逆向折叠,通过逐步决策和无需重新训练的灵活扩展,能够生成结构一致且多样化的序列。
English: ProtInvTree introduces a reward-guided tree-search framework for protein inverse folding, enabling the generation of structurally consistent and diverse sequences through step-wise decision-making and flexible scaling without retraining.
Authors:Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, Benjamin Busam
Abstract:
We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at https://xyz-ibd.github.io/XYZ-IBD/.
中文:XYZ-IBD是一个针对真实工业场景的抓取数据集,通过毫米级精确标注呈现了复杂物体和遮挡环境,相比现有家居数据集更具工业挑战性。
English: XYZ-IBD is a bin-picking dataset that captures real-world industrial complexity with challenging objects and millimeter-accurate annotations, providing a more realistic benchmark than existing household-focused datasets.
Authors:Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu
Abstract:
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.
中文总结:提出的DCAR框架通过动态分块预测和块到帧注意力机制,显著改进了自回归语音合成,在测试集上实现了72.27%的可懂度提升和2.61倍的推理加速,有效克服了传统逐词预测模型的局限性。
English Summary: The proposed DCAR framework enhances autoregressive speech synthesis by introducing dynamic chunk-wise prediction and chunk-to-frame attention, significantly improving intelligibility by 72.27% and inference speed by 2.61x while overcoming limitations of traditional next-token prediction models.
Authors:Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu
Abstract:
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 46% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.
中文:CodecSlime首次在神经语音编解码器中引入动态帧率方法压缩时间冗余,在保持架构不变的同时显著降低重建误差,并实现质量与比特率的灵活权衡。
English: CodecSlime introduces a dynamic frame rate method to compress temporal redundancy in neural speech codecs, significantly reducing reconstruction errors and enabling flexible quality-bitrate trade-offs without architectural changes.
Authors:Ankit Shah, Rita Singh, Bhiksha Raj, Alexander Hauptmann
Abstract:
The escalating rates of gun-related violence and mass shootings represent a significant threat to public safety. Timely and accurate information for law enforcement agencies is crucial in mitigating these incidents. Current commercial gunshot detection systems, while effective, often come with prohibitive costs. This research explores a cost-effective alternative by leveraging acoustic analysis of gunshot recordings, potentially obtainable from ubiquitous devices like cell phones, to not only detect gunshots but also classify the type of firearm used. This paper details a study on deciphering gun type hierarchies using a curated dataset of 3459 recordings. We investigate the fundamental acoustic characteristics of gunshots, including muzzle blasts and shockwaves, which vary based on firearm type, ammunition, and shooting direction. We propose and evaluate machine learning frameworks, including Support Vector Machines (SVMs) as a baseline and a more advanced Convolutional Neural Network (CNN) architecture for joint gunshot detection and gun type classification. Results indicate that our deep learning approach achieves a mean average precision (mAP) of 0.58 on clean labeled data, outperforming the SVM baseline (mAP 0.39). Challenges related to data quality, environmental noise, and the generalization capabilities when using noisy web-sourced data (mAP 0.35) are also discussed. The long-term vision is to develop a highly accurate, real-time system deployable on common recording devices, significantly reducing detection costs and providing critical intelligence to first responders.
Chinese: 本研究开发了一种经济高效的声学分析系统,利用机器学习从录音中检测枪声并分类枪支类型,通过深度学习提升了准确率,同时针对环境噪声等挑战进行优化,旨在未来实现在普通设备上的实时部署。
English: This research develops a cost-effective acoustic analysis system using machine learning to detect gunshots and classify firearm types from recordings, achieving improved accuracy with deep learning while addressing challenges like environmental noise for future real-time deployment on common devices.
Authors:Rui An, Yifeng Zhang, Ziran Liang, Wenqi Fan, Yuxuan Liang, Xuequn Shang, Qing Li
Abstract:
Training urban spatio-temporal foundation models that generalize well across diverse regions and cities is critical for deploying urban services in unseen or data-scarce regions. Recent studies have typically focused on fusing cross-domain spatio-temporal data to train unified Transformer-based models. However, these models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment. Inspired by the efficiency of Mamba, a state space model with linear time complexity, we explore its potential for efficient urban spatio-temporal prediction. However, directly applying Mamba as a spatio-temporal backbone leads to negative transfer and severe performance degradation. This is primarily due to spatio-temporal heterogeneity and the recursive mechanism of Mamba's hidden state updates, which limit cross-domain generalization. To overcome these challenges, we propose Damba-ST, a novel domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction. Damba-ST retains Mamba's linear complexity advantage while significantly enhancing its adaptability to heterogeneous domains. Specifically, we introduce two core innovations: (1) a domain-adaptive state space model that partitions the latent representation space into a shared subspace for learning cross-domain commonalities and independent, domain-specific subspaces for capturing intra-domain discriminative features; (2) three distinct Domain Adapters, which serve as domain-aware proxies to bridge disparate domain distributions and facilitate the alignment of cross-domain commonalities. Extensive experiments demonstrate the generalization and efficiency of Damba-ST. It achieves state-of-the-art performance on prediction tasks and demonstrates strong zero-shot generalization, enabling seamless deployment in new urban environments without extensive retraining or fine-tuning.
中文: 针对城市时空基础模型在跨域泛化和计算效率上的挑战,提出的Damba-ST模型通过域自适应状态空间划分和域适配器设计,在保持线性复杂度的同时实现了零样本跨城市高效预测。
English: Urban spatio-temporal foundation models face scalability and generalization challenges due to high computational complexity and domain heterogeneity, which the proposed Damba-ST model addresses with domain-adaptive state space partitioning and domain adapters to achieve efficient, zero-shot generalization across cities.
Authors:Rita Singh, Bhiksha Raj
Abstract:
Voice is increasingly being used as a biometric entity in many applications. These range from speaker identification and verification systems to human profiling technologies that attempt to estimate myriad aspects of the speaker's persona from their voice. However, for an entity to be a true biometric identifier, it must be unique. This paper establishes a first framework for calculating the uniqueness of human voice objectively. The approach in this paper is based on statistical considerations that take into account a set of measurable characteristics of the voice signal that bear a causal relationship to the vocal production process, but are not inter-dependent or derivable from each other. Depending on how we quantize these variables, we show that the chances of two people having the same voice in a world populated by 10 billion people range from one in a few thousand, to one in a septillion or less. The paper also discusses the implications of these calculations on the choices made in voice processing applications.
中文: 本文首次建立了客观评估人声独特性作为生物识别标识的框架,研究表明在全球百亿人口中两人拥有相同声音的概率可从数千分之一低至十万亿亿分之一,并探讨了该结论对语音处理应用的影响。
English: This paper introduces a groundbreaking framework for objectively assessing the uniqueness of human voice as a biometric identifier, demonstrating that the probability of two individuals sharing identical vocal characteristics can range from one in thousands to one in a septillion among a global population of 10 billion.
Authors:Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou, Xiyao Xiao, Si Chen, Hongning Wang, Minlie Huang
Abstract:
Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.
中文摘要:SocialSim是一种创新框架,通过整合社会互动中的自我表露与社会意识来优化情感支持对话,构建出质量超越众包数据的大规模合成语料库。
English Summary: SocialSim is a novel framework that enhances emotional support conversations by integrating social disclosure and social awareness, creating a high-quality synthetic corpus that outperforms crowdsourced data in effectiveness.
Authors:Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang
Abstract:
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.
Chinese: Sekai 数据集通过提供超过5000小时的带标注第一人称全球视频,解决了现有视频生成数据集的局限性,为训练如YUME等交互式世界探索模型奠定了基础。
English: The Sekai dataset addresses limitations in current video generation datasets by providing over 5,000 hours of annotated first-person global footage, enabling the training of interactive world exploration models like YUME.
Authors:Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
Chinese: 大型视觉语言模型存在多语言物体幻觉问题,而提出的CLAIM方法通过对齐跨语言注意力模式,无需大量重新训练即可有效缓解该问题,在多个基准测试中取得了显著改进。
English: Large Vision-Language Models suffer from multilingual object hallucination, but the proposed CLAIM method effectively mitigates this issue by aligning cross-lingual attention patterns without requiring extensive retraining, achieving significant improvements across multiple benchmarks.
Authors:Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj
Abstract:
Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.
Chinese Summary: 本文提出的CoLMbo说话人语言模型通过集成说话人编码器和提示调节机制,能够生成包含方言、年龄等特征的详细说话人描述,突破了传统说话人识别系统仅能分类的局限,实现了零样本场景下的自适应分析。
English Summary: This paper introduces CoLMbo, a Speaker Language Model that overcomes the limitations of traditional speaker recognition systems by generating detailed speaker descriptions and adapting to new characteristics through prompt-based conditioning, significantly advancing the field.
Authors:Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Sixun Dong, Kunpeng Liu, Yanjie Fu
Abstract:
Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model's latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5\% accuracy gain on MathQA without additional training.
中文摘要:本文提出了一种轻量级后训练框架,通过对比推理反馈和残差嵌入优化来改进大语言模型的潜在推理能力,无需额外训练即可显著提升准确性。
English Summary: This paper introduces a lightweight post-training framework that enhances latent reasoning in Large Language Models through contrastive feedback and residual embedding refinement, achieving notable accuracy improvements without extra training.
Authors:Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang
Abstract:
Long document understanding has become increasingly crucial in natural language processing, with retrieval-based methods emerging as a promising solution to address the context length limitations of large language models (LLMs). However, existing approaches either treat documents as flat sequences or employ arbitrary chunking strategies, failing to capture the inherent discourse structure that guides human comprehension. We present DISRetrieval, a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding. Our approach introduces three key innovations: (1) a discourse-aware document organization framework that utilizes rhetorical structure theory (RST) to create sentence-level hierarchical representations, preserving both semantic relationships and natural document flow; (2) an LLM-enhanced node representation technique that combines discourse structure with adaptive summarization to enrich tree nodes with contextual information; and (3) a hierarchical evidence retrieval mechanism that effectively selects relevant content while maintaining discourse coherence. Through comprehensive experiments on QASPER and QuALITY datasets, DISRetrieval demonstrates substantial improvements over existing methods in both token-level retrieval metrics and downstream question answering tasks. Our ablation studies confirm that incorporating discourse structure significantly enhances retrieval effectiveness across different document lengths and query types, validating the importance of linguistically-informed document representation in long-text understanding. Our code and datasets are publicly available at github/DreamH1gh/DISRetrieval to facilitate future research.
中文摘要:本研究提出了一种基于修辞结构理论的语篇感知分层框架,通过将语篇树转化为增强型表征并采用结构引导检索方法,显著提升了长文档问答系统的性能,在多个数据集上均取得了稳定改进。
English Summary: This study introduces a discourse-aware hierarchical framework that uses rhetorical structure theory to improve long document question answering by converting discourse trees into enhanced representations and employing structure-guided retrieval, demonstrating consistent performance gains across multiple datasets.
Authors:Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang
Abstract:
Long document question answering systems typically process texts as flat sequences or use arbitrary segmentation, failing to capture discourse structures that guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) to enhance long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Comprehensive experiments on QASPER, QuALITY, and NarrativeQA demonstrate consistent improvements over existing approaches. Ablation studies confirm that incorporating discourse structure significantly enhances question answering across diverse document types.
中文摘要:本研究提出了一种基于修辞结构理论的语篇感知分层框架,通过将语篇树转化为增强型表征并采用结构引导检索方法,显著提升了长文档问答系统的性能,在多个数据集上均取得了稳定改进。
English Summary: This study introduces a discourse-aware hierarchical framework that uses rhetorical structure theory to improve long document question answering by converting discourse trees into enhanced representations and employing structure-guided retrieval, demonstrating consistent performance gains across multiple datasets.
Authors:Yunqing Liu, Wenqi Fan, Xiaoyong Wei, Qing Li
Abstract:
Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein-protein structure similarity (global information). To address this, we propose \textbf{GLProtein}, the first framework in protein pre-training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein-masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure-based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein-protein interaction, contact prediction, and so on.
中文摘要:GLProtein是首个结合全局结构相似性与局部氨基酸细节的蛋白质预训练框架,通过整合蛋白质三维距离编码和分子编码,在多个生物信息学任务中表现优于现有方法。
English Summary: GLProtein is a novel protein pre-training framework that integrates both global structural similarity and local amino acid details to improve prediction accuracy in bioinformatics tasks, outperforming previous methods.
Authors:Xinxin Li, Huiyao Chen, Chengjun Liu, Jing Li, Meishan Zhang, Jun Yu, Min Zhang
Abstract:
Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.
Chinese: 本研究通过引入检索增强生成与自我修正机制,弥补了生成式大语言模型在语义角色标注任务中与编码器-解码器模型的性能差距,在多个基准测试中实现了最先进的成果。
English: This study bridges the performance gap in semantic role labeling between generative large language models and encoder-decoder models by introducing retrieval-augmented generation and self-correction mechanisms, achieving state-of-the-art results across multiple benchmarks.
Authors:Yuxin Wen, Yangsibo Huang, Tom Goldstein, Ravi Kumar, Badih Ghazi, Chiyuan Zhang
Abstract:
Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.
中文: 本研究探索视觉语言模型中的跨模态记忆特性,发现知识虽能在模态间迁移,但在多种场景下源模态与目标模态间仍存在显著性能差距,并提出一种基线方法来应对这一挑战。
English: This study investigates cross-modality memorization in vision-language models, revealing that while knowledge transfers between modalities, a significant performance gap persists across various scenarios, and proposes a baseline method to address this challenge.
Authors:Daogao Liu, Edith Cohen, Badih Ghazi, Peter Kairouz, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Adam Sealfon, Da Yu, Chiyuan Zhang
Abstract:
We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.
中文: Urania是一种新颖框架,通过私有聚类和创新关键词提取方法,在严格差分隐私保护下生成LLM聊天机器人交互的洞察,有效平衡了数据效用与隐私保护。
English: Urania is a novel framework that generates insights from LLM chatbot interactions with rigorous differential privacy guarantees, effectively balancing data utility with privacy preservation through private clustering and innovative keyword extraction methods.
Authors:Pengfei He, Zhenwei Dai, Xianfeng Tang, Yue Xing, Hui Liu, Jingying Zeng, Qiankun Peng, Shrivats Agrawal, Samarth Varshney, Suhang Wang, Jiliang Tang, Qi He
Abstract:
Large Language Model-based Multi-Agent Systems (LLM-MAS) have demonstrated strong capabilities in solving complex tasks but remain vulnerable when agents receive unreliable messages. This vulnerability stems from a fundamental gap: LLM agents treat all incoming messages equally without evaluating their trustworthiness. While some existing studies approach the trustworthiness, they focus on a single type of harmfulness rather than analyze it in a holistic approach from multiple trustworthiness perspectives. In this work, we propose Attention Trust Score (A-Trust), a lightweight, attention-based method for evaluating message trustworthiness. Inspired by human communication literature[1], through systematically analyzing attention behaviors across six orthogonal trust dimensions, we find that certain attention heads in the LLM specialize in detecting specific types of violations. Leveraging these insights, A-Trust directly infers trustworthiness from internal attention patterns without requiring external prompts or verifiers. Building upon A-Trust, we develop a principled and efficient trust management system (TMS) for LLM-MAS, enabling both message-level and agent-level trust assessment. Experiments across diverse multi-agent settings and tasks demonstrate that applying our TMS significantly enhances robustness against malicious inputs.
中文: 针对LLM-MAS易受不可靠消息影响的问题,我们提出基于注意力机制的轻量级A-Trust方法,通过分析六个信任维度的内部注意力模式来评估消息可信度,有效提升了系统对抗恶意输入的鲁棒性。
English: LLM-MAS are vulnerable to unreliable messages, so we propose A-Trust, a lightweight attention-based method that evaluates message trustworthiness by analyzing internal attention patterns across six trust dimensions, enhancing system robustness against malicious inputs.
Authors:Yixuan Hou, Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Abstract:
Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.
Chinese: 随着语音生成技术的进步,SOVA-Bench应运而生,它作为首个系统评估语音大语言模型的框架,全面检验其语义理解与声学生成能力,推动语音交互系统的发展。
English: Recent advances in speech generation from user instructions highlight the need for comprehensive benchmarks, leading to the introduction of SOVA-Bench, a systematic framework evaluating both semantic and acoustic capabilities of speech large language models.
Authors:Yangfan Ye, Xiaocheng Feng, Zekun Yuan, Xiachong Feng, Libo Qin, Lei Huang, Weitao Ma, Yichong Huang, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
Abstract:
Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.
中文:CC-Tuning提出了一种新颖的多语言微调方法,通过在潜在层面融合英语和非英语输入的激活值来增强跨语言交互,其性能优于标准方法,并展示了更深层次语言整合的潜力。
English: CC-Tuning introduces a novel multilingual fine-tuning method that enhances cross-lingual interactions at the latent level by fusing activations from English and non-English inputs, outperforming standard approaches and demonstrating the potential of deeper linguistic integration.
Authors:Dongyue Wu, Zilin Guo, Jialong Zuo, Nong Sang, Changxin Gao
Abstract:
The ever-growing size of training datasets enhances the generalization capability of modern machine learning models but also incurs exorbitant computational costs. Existing data pruning approaches aim to accelerate training by removing those less important samples. However, they often rely on gradients or proxy models, leading to prohibitive additional costs of gradient back-propagation and proxy model training. In this paper, we propose Partial Forward Blocking (PFB), a novel framework for lossless training acceleration. The efficiency of PFB stems from its unique adaptive pruning pipeline: sample importance is assessed based on features extracted from the shallow layers of the target model. Less important samples are then pruned, allowing only the retained ones to proceed with the subsequent forward pass and loss back-propagation. This mechanism significantly reduces the computational overhead of deep-layer forward passes and back-propagation for pruned samples, while also eliminating the need for auxiliary backward computations and proxy model training. Moreover, PFB introduces probability density as an indicator of sample importance. Combined with an adaptive distribution estimation module, our method dynamically prioritizes relatively rare samples, aligning with the constantly evolving training state. Extensive experiments demonstrate the significant superiority of PFB in performance and speed. On ImageNet, PFB achieves a 0.5% accuracy improvement and 33% training time reduction with 40% data pruned.
中文: 提出的部分前向阻断(PFB)框架通过基于浅层特征和概率密度自适应剪除次要样本,无需梯度或代理模型即可加速机器学习训练,在提升精度的同时显著降低计算开销。
English: The proposed Partial Forward Blocking (PFB) framework accelerates machine learning training by adaptively pruning less important samples based on shallow-layer features and probability density, eliminating the need for gradients or proxy models while improving accuracy and reducing computational costs.
Authors:Hongyan An, Kuan Zhu, Xin He, Haiyun Guo, Chaoyang Zhao, Ming Tang, Jinqiao Wang
Abstract:
Pedestrian attribute recognition (PAR) is a fundamental perception task in intelligent transportation and security. To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information. However, a regional feature is typically used to predict a fixed set of pre-defined attributes in these methods, which limits the performance and practicality in two aspects: 1) Regional features may compromise fine-grained patterns unique to certain attributes in favor of capturing common characteristics shared across attributes. 2) Regional features cannot generalize to predict unseen attributes in the test time. In this paper, we propose the \textbf{F}ine-grained \textbf{O}ptimization with semanti\textbf{C} g\textbf{U}ided under\textbf{S}tanding (FOCUS) approach for PAR, which adaptively extracts fine-grained attribute-level features for each attribute individually, regardless of whether the attributes are seen or not during training. Specifically, we propose the Multi-Granularity Mix Tokens (MGMT) to capture latent features at varying levels of visual granularity, thereby enriching the diversity of the extracted information. Next, we introduce the Attribute-guided Visual Feature Extraction (AVFE) module, which leverages textual attributes as queries to retrieve their corresponding visual attribute features from the Mix Tokens using a cross-attention mechanism. To ensure that textual attributes focus on the appropriate Mix Tokens, we further incorporate a Region-Aware Contrastive Learning (RACL) method, encouraging attributes within the same region to share consistent attention maps. Extensive experiments on PA100K, PETA, and RAPv1 datasets demonstrate the effectiveness and strong generalization ability of our method.
中文: FOCUS方法通过多粒度混合令牌和跨模态注意力机制,自适应提取细粒度属性特征,克服了固定区域特征的局限性,实现了对未见过属性的泛化能力。
English: The FOCUS method enhances pedestrian attribute recognition by adaptively extracting fine-grained, attribute-level features through multi-granularity token mixing and cross-modal attention, overcoming limitations of fixed regional features and enabling generalization to unseen attributes.
Authors:Yifan Xue, Ruihuai Liang, Bo Yang, Xuelin Cao, Zhiwen Yu, Mérouane Debbah, Chau Yuen
Abstract:
With the rapid development of the low-altitude economy, air-ground integrated multi-access edge computing (MEC) systems are facing increasing demands for real-time and intelligent task scheduling. In such systems, task offloading and resource allocation encounter multiple challenges, including node heterogeneity, unstable communication links, and dynamic task variations. To address these issues, this paper constructs a three-layer heterogeneous MEC system architecture for low-altitude economic networks, encompassing aerial and ground users as well as edge servers. The system is systematically modeled from the perspectives of communication channels, computational costs, and constraint conditions, and the joint optimization problem of offloading decisions and resource allocation is uniformly abstracted into a graph-structured modeling task. On this basis, we propose a graph attention diffusion-based solution generator (GADSG). This method integrates the contextual awareness of graph attention networks with the solution distribution learning capability of diffusion models, enabling joint modeling and optimization of discrete offloading variables and continuous resource allocation variables within a high-dimensional latent space. We construct multiple simulation datasets with varying scales and topologies. Extensive experiments demonstrate that the proposed GADSG model significantly outperforms existing baseline methods in terms of optimization performance, robustness, and generalization across task structures, showing strong potential for efficient task scheduling in dynamic and complex low-altitude economic network environments.
中文摘要:本文提出基于图注意力扩散的解决方案生成器(GADSG),通过融合图注意力网络的上下文感知与扩散模型的解分布学习能力,有效解决了低空经济网络中异构空地MEC系统的任务卸载与资源分配联合优化问题,在多场景实验中展现出卓越性能。
English Summary: This paper introduces a graph attention diffusion-based solution generator (GADSG) that effectively addresses joint optimization of task offloading and resource allocation in heterogeneous air-ground MEC systems for low-altitude economy networks, demonstrating superior performance in dynamic environments through comprehensive experiments.
Authors:Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu
Abstract:
The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain-of-debunk strategy where a multi-agent MLLM system produces logic-grounded, multimodal-aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning-rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.
Chinese: 提出的虚假新闻检测去伪推演框架(DIFND)融合生成扩散模型与多模态推理,通过生成合成证据和逻辑验证来提升检测准确性与可解释性。
English: The proposed Debunk-and-Infer framework for Fake News Detection (DIFND) integrates generative diffusion models with multimodal reasoning to enhance detection accuracy and interpretability by producing synthetic evidence and logic-grounded judgments.
Authors:Mengqi Wang, Tiantian Feng, Shrikanth Narayanan
Abstract:
Large language models (LLMs) have enabled a wide variety of real-world applications in various domains. However, creating a high-performing application with high accuracy remains challenging, particularly for subjective tasks like emotion recognition. Inspired by the SLT 2024 GenSER Challenge, this study investigates approaches to improving conversational emotion recognition (CER) by LLMs. Specifically, we explore how to retrieve high-quality examples in in-context learning (ICL) to enhance CER. We propose various strategies based on random and augmented example retrieval and also analyze the impact of conversational context on CER accuracy. Experiments were conducted on the three datasets including IEMOCAP, MELD and EmoryNLP. The results show that augmented example retrieval consistently outperforms other techniques under investigation across all datasets, highlighting the importance of retrieving coherent targeted examples and enhancing them through paraphrasing.
中文摘要:本研究通过探索上下文学习中高质量示例的检索方法,改进了大型语言模型在对话情绪识别中的表现,实验表明增强示例检索在多个数据集上均优于其他技术。
English Summary: This study explores methods to improve conversational emotion recognition using large language models by investigating high-quality example retrieval in in-context learning, with experiments showing augmented example retrieval consistently outperforms other techniques across multiple datasets.
Authors:Junwei Zhou, Xueting Li, Lu Qi, Ming-Hsuan Yang
Abstract:
Existing 4D synthesis methods primarily focus on object-level generation or dynamic scene synthesis with limited novel views, restricting their ability to generate multi-view consistent and immersive dynamic 4D scenes. To address these constraints, we propose a framework (dubbed as CoCo4D) for generating detailed dynamic 4D scenes from text prompts, with the option to include images. Our method leverages the crucial observation that articulated motion typically characterizes foreground objects, whereas background alterations are less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two responsibilities: modeling the dynamic foreground and creating the evolving background, both directed by a reference motion sequence. Given a text prompt and an optional reference image, CoCo4D first generates an initial motion sequence utilizing video diffusion models. This motion sequence then guides the synthesis of both the dynamic foreground object and the background using a novel progressive outpainting scheme. To ensure seamless integration of the moving foreground object within the dynamic background, CoCo4D optimizes a parametric trajectory for the foreground, resulting in realistic and coherent blending. Extensive experiments show that CoCo4D achieves comparable or superior performance in 4D scene generation compared to existing methods, demonstrating its effectiveness and efficiency. More results are presented on our website https://colezwhy.github.io/coco4d/.
中文: 现有4D合成方法难以生成多视角一致的动态场景,而CoCo4D通过运动序列引导的前景与背景分离建模,从文本或图像中创造出沉浸式4D场景。
English: Current 4D synthesis methods are limited in generating multi-view consistent dynamic scenes, but CoCo4D addresses this by separating foreground and background modeling guided by motion sequences to produce immersive 4D scenes from text or images.
Authors:Xiaoyuan Wang, Yizhou Zhao, Botao Ye, Xiaojun Shan, Weijie Lyu, Lu Qi, Kelvin C. K. Chan, Yinxiao Li, Ming-Hsuan Yang
Abstract:
We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (\eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that \ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.
中文摘要:HoliGS是一种新颖的可变形高斯泼溅框架,能够从单目RGB视频高效重建大规模动态场景,相比现有技术在显著降低计算成本的同时实现了更优的重建质量。
English Summary: HoliGS is a deformable Gaussian splatting framework that efficiently reconstructs large-scale dynamic scenes from monocular videos, achieving superior quality with reduced computational costs compared to existing methods.
Authors:Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ning Jiang, Quan Lu, Ming Tang, Jinqiao Wang
Abstract:
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf{Referring Expression Instance Retrieval (REIR)}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high-quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language-Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language-Instance Alignment (CLIA) for improved retrieval across images. We will release our code and benchmark publicly.
中文: 本文提出了名为“指代表达实例检索(REIR)”的新任务,通过细粒度指代表达同时实现实例级检索与定位,并构建了大规模基准数据集和基线方法来解决这一复杂实际需求。
English: The paper introduces a new task called Referring Expression Instance Retrieval (REIR), which enables both instance-level retrieval and localization using fine-grained referring expressions, and proposes a large-scale benchmark and a baseline method to address this complex real-world need.
Authors:Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, Jinqiao Wang
Abstract:
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.
中文: FOCUS作为一种统一的大型视觉语言模型,通过端到端框架将分割感知与可控的以对象为中心的生成相结合,在多模态理解、分割精度和图像生成任务中通过联合优化展现出卓越性能。
English: FOCUS is a unified Large Vision Language Model that integrates segmentation-aware perception and controllable object-centric generation in an end-to-end framework, achieving strong performance in multimodal understanding, segmentation accuracy, and image generation through joint optimization.
Authors:Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Weinan E, Siheng Chen
Abstract:
As AI capabilities advance toward and potentially beyond human-level performance, a natural transition emerges where AI-driven development becomes more efficient than human-centric approaches. A promising pathway toward this transition lies in AI-for-AI (AI4AI), which leverages AI techniques to automate and optimize the design, training, and deployment of AI systems themselves. While LLM-based agents have shown the potential to realize AI4AI, they are often unable to fully leverage the experience accumulated by agents during the exploration of solutions in the reasoning process, leading to inefficiencies and suboptimal performance. To address this limitation, we propose ML-Master, a novel AI4AI agent that seamlessly integrates exploration and reasoning by employing a selectively scoped memory mechanism. This approach allows ML-Master to efficiently combine diverse insights from parallel solution trajectories with analytical reasoning, guiding further exploration without overwhelming the agent with excessive context. We evaluate ML-Master on the MLE-Bench, where it achieves a 29.3% average medal rate, significantly surpassing existing methods, particularly in medium-complexity tasks, while accomplishing this superior performance within a strict 12-hour time constraint-half the 24-hour limit used by previous baselines. These results demonstrate ML-Master's potential as a powerful tool for advancing AI4AI.
中文摘要:ML-Master是一种新型AI4AI智能体,通过选择性记忆机制无缝整合探索与推理,在MLE-Bench上以基准测试一半时间实现29.3%平均奖牌率,显著超越现有方法。
English Summary: ML-Master is a novel AI4AI agent that integrates exploration and reasoning through selective memory, achieving superior performance on MLE-Bench with a 29.3% average medal rate in half the time of previous methods.
Authors:Siyi Xie, Hanxin Zhu, Tianyu He, Xin Li, Zhibo Chen
Abstract:
Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.
中文摘要:Sonic4D是一种创新框架,通过视觉定位和基于物理的模拟生成与4D视觉场景同步的空间音频,无需训练即可显著提升沉浸式体验。
English Summary: Sonic4D is a novel framework that generates spatial audio synchronized with 4D visual scenes through visual grounding and physics-based simulation, significantly enhancing immersive experiences without requiring training.
Authors:Haoyou Deng, Zhiqiang Li, Feng Zhang, Qingbo Lu, Zisheng Cao, Yuanjie Shao, Shuhang Gu, Changxin Gao, Nong Sang
Abstract:
Overfitting to synthetic training pairs remains a critical challenge in image dehazing, leading to poor generalization capability to real-world scenarios. To address this issue, existing approaches utilize unpaired realistic data for training, employing CycleGAN or contrastive learning frameworks. Despite their progress, these methods often suffer from training instability, resulting in limited dehazing performance. In this paper, we propose a novel training strategy for unpaired image dehazing, termed Rehazy, to improve both dehazing performance and training stability. This strategy explores the consistency of the underlying clean images across hazy images and utilizes hazy-rehazy pairs for effective learning of real haze characteristics. To favorably construct hazy-rehazy pairs, we develop a physics-based rehazy generation pipeline, which is theoretically validated to reliably produce high-quality rehazy images. Additionally, leveraging the rehazy strategy, we introduce a dual-branch framework for dehazing network training, where a clean branch provides a basic dehazing capability in a synthetic manner, and a hazy branch enhances the generalization ability with hazy-rehazy pairs. Moreover, we design a new dehazing network within these branches to improve the efficiency, which progressively restores clean scenes from coarse to fine. Extensive experiments on four benchmarks demonstrate the superior performance of our approach, exceeding the previous state-of-the-art methods by 3.58 dB on the SOTS-Indoor dataset and by 1.85 dB on the SOTS-Outdoor dataset in PSNR. Our code will be publicly available.
Chinese: 本文提出了一种名为Rehazy的新型无配对图像去雾训练策略,通过利用雾霾-再雾霾图像对和双分支框架,显著提升了去雾性能和训练稳定性,在多个基准测试中取得了领先的成果。
English: This paper introduces Rehazy, a novel training strategy for unpaired image dehazing that enhances performance and stability by leveraging hazy-rehazy pairs and a dual-branch framework, achieving state-of-the-art results on benchmark datasets.
Authors:Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan
Abstract:
Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set. Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble.
中文摘要:本文提出一个可复现的框架,通过多模态学习、多任务学习及不平衡数据处理技术,在IS25-SER挑战赛中取得最佳性能,有效解决了标注不一致和数据分布不平衡的难题。
English summary: This paper introduces a reproducible framework that achieved top performance in the IS25-SER Challenge by addressing labeling inconsistencies and data imbalance through multimodal learning, multi-task learning, and strategic data handling techniques.
Authors:Anfeng Xu, Tiantian Feng, Shrikanth Narayanan
Abstract:
Automatic Speech Recognition systems have made significant progress with large-scale pre-trained models. However, most current systems focus solely on transcribing the speech without identifying speaker roles, a function that is critical for conversational AI. In this work, we investigate the use of serialized output training (SOT) for joint ASR and speaker role tagging. By augmenting Whisper with role-specific tokens and fine-tuning it with SOT, we enable the model to generate role-aware transcriptions in a single decoding pass. We compare the SOT approach against a self-supervised previous baseline method on two real-world conversational datasets. Our findings show that this approach achieves more than 10% reduction in multi-talker WER, demonstrating its feasibility as a unified model for speaker-role aware speech transcription.
中文摘要:本研究通过结合序列化输出训练与角色特定标记来增强Whisper模型,实现了单次解码即可同时完成语音识别和说话人角色标注,在多说话人场景下将词错误率降低了10%以上。
English Summary: This study enhances Whisper by integrating speaker role tokens with serialized output training, enabling simultaneous speech recognition and speaker role identification in a single decoding pass, which reduces multi-talker word error rate by over 10%.
Authors:Xijun Wang, Xin Li, Bingchen Li, Zhibo Chen
Abstract:
Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$α$, achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.
中文: LiftVSR是一种高效的视频超分辨率框架,通过动态时序注意力实现短时帧间建模和注意力记忆缓存确保长时一致性,以显著降低的计算成本取得了领先性能。
English: LiftVSR is an efficient video super-resolution framework that combines Dynamic Temporal Attention for short-term frame consistency and an Attention Memory Cache for long-term coherence, achieving state-of-the-art results with reduced computational costs.
Authors:Shang Qu, Ning Ding, Linhai Xie, Yifei Li, Zaoqu Liu, Kaiyan Zhang, Yibai Xiong, Yuxin Zuo, Zhangren Chen, Ermo Hua, Xingtai Lv, Youbang Sun, Yang Li, Dong Li, Fuchu He, Bowen Zhou
Abstract:
This paper introduces PROTEUS, a fully automated system that produces data-driven hypotheses from raw data files. We apply PROTEUS to clinical proteogenomics, a field where effective downstream data analysis and hypothesis proposal is crucial for producing novel discoveries. PROTEUS uses separate modules to simulate different stages of the scientific process, from open-ended data exploration to specific statistical analysis and hypothesis proposal. It formulates research directions, tools, and results in terms of relationships between biological entities, using unified graph structures to manage complex research processes. We applied PROTEUS to 10 clinical multiomics datasets from published research, arriving at 360 total hypotheses. Results were evaluated through external data validation and automatic open-ended scoring. Through exploratory and iterative research, the system can navigate high-throughput and heterogeneous multiomics data to arrive at hypotheses that balance reliability and novelty. In addition to accelerating multiomic analysis, PROTEUS represents a path towards tailoring general autonomous systems to specialized scientific domains to achieve open-ended hypothesis generation from data.
中文:PROTEUS是一个全自动系统,通过模块化模拟科研流程从临床蛋白质基因组学数据中生成数据驱动假设,基于10个多组学数据集提出360项假设,并通过外部验证实现可靠性与新发现性的平衡。
English: PROTEUS is an automated system that generates data-driven hypotheses from clinical proteogenomics data through modular simulation of scientific processes, producing 360 validated hypotheses from multiomics datasets while balancing reliability and novelty.
Authors:Zhaoliang Wan, Zetong Bi, Zida Zhou, Hao Ren, Yiming Zeng, Yihan Li, Lu Qi, Xu Yang, Ming-Hsuan Yang, Hui Cheng
Abstract:
This paper addresses the scarcity of low-cost but high-dexterity platforms for collecting real-world multi-fingered robot manipulation data towards generalist robot autonomy. To achieve it, we propose the RAPID Hand, a co-optimized hardware and software platform where the compact 20-DoF hand, robust whole-hand perception, and high-DoF teleoperation interface are jointly designed. Specifically, RAPID Hand adopts a compact and practical hand ontology and a hardware-level perception framework that stably integrates wrist-mounted vision, fingertip tactile sensing, and proprioception with sub-7 ms latency and spatial alignment. Collecting high-quality demonstrations on high-DoF hands is challenging, as existing teleoperation methods struggle with precision and stability on complex multi-fingered systems. We address this by co-optimizing hand design, perception integration, and teleoperation interface through a universal actuation scheme, custom perception electronics, and two retargeting constraints. We evaluate the platform's hardware, perception, and teleoperation interface. Training a diffusion policy on collected data shows superior performance over prior works, validating the system's capability for reliable, high-quality data collection. The platform is constructed from low-cost and off-the-shelf components and will be made public to ensure reproducibility and ease of adoption.
中文摘要:本文提出RAPID Hand协同优化平台,通过集成紧凑型20自由度机械手、全手感知系统与高自由度遥操作界面,以低成本方案实现高质量多指操作数据采集,为通用机器人自主性研究提供可靠数据支持。
English Summary: This paper introduces the RAPID Hand, a co-designed hardware-software platform that integrates a compact 20-DoF hand, whole-hand perception, and teleoperation interface to enable cost-effective collection of high-quality multi-fingered manipulation data for advancing generalist robot autonomy.
Authors:Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao
Abstract:
Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding model and force the new model to reconstruct outdated representations regardless of their quality, thereby hindering the learning process of the new model. In this paper, we propose to switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model's confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.
Chinese: 本文提出一种基于双曲几何的后向兼容表示学习方法,通过引入不确定性建模和动态对齐机制,在不重新处理存储数据的前提下提升了模型间的兼容性。
English: This paper introduces a hyperbolic geometry-based approach for backward compatible representation learning, which incorporates uncertainty modeling and dynamic alignment to enhance model compatibility without reprocessing stored data.
Authors:Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao
Abstract:
The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.
中文: 本文提出RadialRouter框架,采用径向结构的轻量级Transformer来优化用户查询与大型语言模型特性的匹配,在多种场景下显著超越现有路由方法,展现出卓越的效能与适应性。
English: This paper introduces RadialRouter, a novel routing framework that uses a lightweight Transformer with a radial structure to better align user queries with LLM characteristics, significantly outperforming existing methods in efficiency and adaptability across various scenarios.
Authors:Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak
Abstract:
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
Chinese: 近期生成式人工智能的进展显著改变了风格标注文本转语音合成领域,但由于缺乏标准化数据集和下游任务研究,实际应用仍面临挑战,为此我们推出了CapSpeech基准,包含大规模标注数据,实现了多样化风格的高保真语音合成。
English: Recent generative AI advancements have transformed style-captioned text-to-speech synthesis, yet real-world application challenges persist due to limited datasets and research, prompting the introduction of CapSpeech—a comprehensive benchmark with extensive annotated data that enables high-fidelity speech synthesis across diverse styles.
Authors:Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak
Abstract:
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.
Chinese: 近期生成式人工智能的进展显著改变了风格标注文本转语音合成领域,但由于缺乏标准化数据集和下游任务研究,实际应用仍面临挑战,为此我们推出了CapSpeech基准,包含大规模标注数据,实现了多样化风格的高保真语音合成。
English: Recent generative AI advancements have transformed style-captioned text-to-speech synthesis, yet real-world application challenges persist due to limited datasets and research, prompting the introduction of CapSpeech—a comprehensive benchmark with extensive annotated data that enables high-fidelity speech synthesis across diverse styles.
Authors:Heming Zhu, Guoxing Sun, Christian Theobalt, Marc Habermann
Abstract:
Learning an animatable and clothed human avatar model with vivid dynamics and photorealistic appearance from multi-view videos is an important foundational research problem in computer graphics and vision. Fueled by recent advances in implicit representations, the quality of the animatable avatars has achieved an unprecedented level by attaching the implicit representation to drivable human template meshes. However, they usually fail to preserve the highest level of detail, particularly apparent when the virtual camera is zoomed in and when rendering at 4K resolution and higher. We argue that this limitation stems from inaccurate surface tracking, specifically, depth misalignment and surface drift between character geometry and the ground truth surface, which forces the detailed appearance model to compensate for geometric errors. To address this, we propose a latent deformation model and supervising the 3D deformation of the animatable character using guidance from foundational 2D video point trackers, which offer improved robustness to shading and surface variations, and are less prone to local minima than differentiable rendering. To mitigate the drift over time and lack of 3D awareness of 2D point trackers, we introduce a cascaded training strategy that generates consistent 3D point tracks by anchoring point tracks to the rendered avatar, which ultimately supervises our avatar at the vertex and texel level. To validate the effectiveness of our approach, we introduce a novel dataset comprising five multi-view video sequences, each over 10 minutes in duration, captured using 40 calibrated 6K-resolution cameras, featuring subjects dressed in clothing with challenging texture patterns and wrinkle deformations. Our approach demonstrates significantly improved performance in rendering quality and geometric accuracy over the prior state of the art.
中文: 本文提出了一种通过潜在变形模型并利用2D视频点追踪器进行监督的方法,结合级联训练策略,显著提升了可动画化人体化身的渲染质量和几何精度。
English: This paper introduces a method to enhance animatable human avatars by using a latent deformation model supervised by 2D video point trackers, improving rendering quality and geometric accuracy through a cascaded training strategy.
Authors:Sen Liang, Zhentao Yu, Zhengguang Zhou, Teng Hu, Hongmei Wang, Yi Chen, Qin Lin, Yuan Zhou, Xin Li, Qinglin Lu, Zhibo Chen
Abstract:
The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
Chinese: OmniV2V是一种多功能视频生成与编辑模型,通过统一的动态内容操作模块和视觉-文本指令系统整合多种任务,在各类场景中展现出优于现有模型的性能。
English: OmniV2V is a versatile video generation and editing model that integrates multiple tasks through a unified dynamic content manipulation module and visual-text instruction system, demonstrating superior performance across various scenarios compared to existing models.
Authors:Tierui Gong, Chau Yuen, Chong Meng Samson See, Mérouane Debbah, Lajos Hanzo
Abstract:
Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution for evolving wireless receivers from the classical to the quantum domain. To further unleash their great potential in wireless communications, we propose a flexible architecture for Rydberg atomic quantum multiple-input multiple-output (RAQ-MIMO) receivers in the multi-user uplink. Then the corresponding signal model of the RAQ-MIMO system is constructed by paving the way from quantum physics to classical wireless communications. Explicitly, we outline the associated operating principles and transmission flow. We also validate the linearity of our model and its feasible region. Based on our model, we derive closed-form asymptotic formulas for the ergodic achievable rate (EAR) of both the maximum-ratio combining (MRC) and zero-forcing (ZF) receivers operating in uncorrelated fading channels (UFC) and the correlated fading channels (CFC), respectively. Furthermore, we theoretically characterize the EAR difference both between the UFC and CFC scenarios, as well as MRC and ZF schemes. More particularly, we quantify the superiority of RAQ-MIMO receivers over the classical massive MIMO (M-MIMO) receivers, specifying an increase of $\log_{2} Π$ of the EAR per user, $Π$-fold reduction of the users' transmit power, and $\sqrt[ν]Π$-fold increase of the transmission distance, respectively, where $Π= \text{ReceiverGainRatio} / \text{ReceiverNoisePowerRatio}$ of the single-sensor receivers and $ν$ is the path-loss exponent. Our simulation results reveal that, compared to classical M-MIMO receivers, our RAQ-MIMO scheme can either realize $\sim 12$ bits/s/Hz/user ($\sim 8$ bits/s/Hz/user) higher EAR, or $\sim 10000$-fold ($\sim 500$-fold) lower transmit power, or alternatively, $\sim 100$-fold ($\sim 21$-fold) longer distance in free-space transmissions, in the standard quantum limit (photon shot limit).
中文: 本文提出了一种灵活的里德堡原子量子MIMO接收器架构,相比经典大规模MIMO系统,在无线通信中实现了更高的数据速率、更低的功耗和更远的传输距离,展现出卓越性能优势。
English: This paper introduces a flexible Rydberg atomic quantum MIMO receiver architecture that demonstrates superior performance over classical massive MIMO systems, achieving significantly higher data rates, lower power consumption, and extended transmission distances in wireless communications.
Authors:Yaxiong Wang, Zhenqiang Zhang, Lechao Cheng, Zhun Zhong, Dan Guo, Meng Wang
Abstract:
Test-time adaption (TTA) has witnessed important progress in recent years, the prevailing methods typically first encode the image and the text and design strategies to model the association between them. Meanwhile, the image encoder is usually frozen due to the absence of explicit supervision in TTA scenarios. We identify a critical limitation in this paradigm: While test-time images often exhibit distribution shifts from training data, existing methods persistently freeze the image encoder due to the absence of explicit supervision during adaptation. This practice overlooks the image encoder's crucial role in bridging distribution shift between training and test. To address this challenge, we propose SSAM (Self-Supervised Association Modeling), a new TTA framework that enables dynamic encoder refinement through dual-phase association learning. Our method operates via two synergistic components: 1) Soft Prototype Estimation (SPE), which estimates probabilistic category associations to guide feature space reorganization, and 2) Prototype-anchored Image Reconstruction (PIR), enforcing encoder stability through cluster-conditional image feature reconstruction. Comprehensive experiments across diverse baseline methods and benchmarks demonstrate that SSAM can surpass state-of-the-art TTA baselines by a clear margin while maintaining computational efficiency. The framework's architecture-agnostic design and minimal hyperparameter dependence further enhance its practical applicability.
中文摘要:本文提出的SSAM框架通过双阶段关联学习实现动态图像编码器优化,有效克服测试时适应的分布偏移问题,在保持计算效率的同时显著超越现有最优方法。
English Summary: The proposed SSAM framework enables dynamic image encoder refinement through dual-phase association learning to overcome distribution shifts in test-time adaptation, outperforming state-of-the-art methods while maintaining computational efficiency.
Authors:Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe
Abstract:
Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.
Chinese: DYNAC是一种自条件连接时序分类方法,将动态词汇集成到中间层,有效捕捉静态与动态标记间的依赖关系,在实现81%实时因子降低的同时仅带来0.1个点的词错误率微增。
English: DYNAC is a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers, effectively capturing dependencies between static and dynamic tokens while reducing real-time factor by 81% with minimal word error rate degradation.
Authors:Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li
Abstract:
Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA-ADD, an integrated Reconstruction-Perception-Reinforcement-Attention networks based forgery trace enhancement-driven robust audio deepfake detection framework. First, we propose a Global-Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi-stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi-stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 3*3 cross-domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains.
Chinese: 当前深度伪造音频检测方法因过度依赖训练数据分布和分类损失而泛化能力不足,为此提出的RPRA-ADD框架通过集成重构-感知-强化-注意力网络增强伪造痕迹检测,在多个基准数据集上实现超过20%的性能提升并展现卓越的跨领域泛化能力。
English: Current deepfake audio detection methods struggle with generalization due to over-reliance on training data distributions and classification losses, leading to the proposed RPRA-ADD framework that integrates reconstruction-perception-reinforcement-attention networks to enhance forgery trace detection and achieves state-of-the-art performance with over 20% improvement across multiple datasets.
Authors:Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe
Abstract:
The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.
Chinese: OWSM项目通过整合并清理大规模YODAS数据集,开发出性能超越前代、媲美工业级模型的OWSM v4系列,所有清洗后的数据、模型及相关脚本将通过ESPnet工具包开源发布。
English: The OWSM project has enhanced its speech foundation models by integrating and cleaning the large-scale YODAS dataset, resulting in OWSM v4 models that outperform previous versions and rival industrial models like Whisper and MMS, with all resources to be publicly released.
Authors:Hongzhou Rao, Yanjie Zhao, Xinyi Hou, Shenao Wang, Haoyu Wang
Abstract:
The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective of software engineering (SE) approaches. To fill the gap, we systematically analyze research status throughout the LLM development lifecycle, divided into six phases: requirements engineering, dataset construction, model development and enhancement, testing and evaluation, deployment and operations, and maintenance and evolution. We then conclude by identifying the key challenges for each phase and presenting potential research directions to address these challenges. In general, we provide valuable insights from an SE perspective to facilitate future advances in LLM development.
中文: 大语言模型发展迅速但面临复杂挑战,本研究从软件工程角度系统分析其开发生命周期,识别关键问题并提出研究方向。
English: Large language models (LLMs) are advancing rapidly but face complex development challenges, so this study systematically analyzes their lifecycle from a software engineering perspective to identify key issues and propose research directions.
Authors:Jiahui Li, Geng Sun, Xiaoyu Sun, Fang Mei, Jingjing Wang, Xiangwang Hou, Daxin Tian, Victor C. M. Leung
Abstract:
Low-altitude wireless networks (LAWNs) have garnered significant attention in the forthcoming 6G networks. In LAWNs, satellites with wide coverage and unmanned aerial vehicles (UAVs) with flexible mobility can complement each other to form integrated satellite-UAV networks, providing ubiquitous and high-speed connectivity for low-altitude operations. However, the higher line-of-sight probability in low-altitude airspace increases transmission security concerns. In this work, we present a collaborative beamforming-based physical layer security scheme for LAWNs. We introduce the fundamental aspects of integrated satellite-UAV networks, physical layer security, UAV swarms, and collaborative beamforming for LAWN applications. Following this, we highlight several opportunities for collaborative UAV swarm secure applications enabled by satellite networks, including achieving physical layer security in scenarios involving data dissemination, data relay, eavesdropper collusion, and imperfect eavesdropper information. Next, we detail two case studies: a secure relay system and a two-way aerial secure communication framework specifically designed for LAWN environments. Simulation results demonstrate that these physical layer security schemes are effective and beneficial for secure low-altitude wireless communications. A short practicality analysis shows that the proposed method is applicable to LAWN scenarios. Finally, we discuss current challenges and future research directions for enhancing security in LAWNs.
中文: 本文针对低空无线网络提出了一种基于协作波束成形的物理层安全方案,通过案例研究和仿真验证了该方案在卫星-无人机集成系统中实现安全通信的有效性。
English: This paper proposes a collaborative beamforming-based physical layer security scheme for low-altitude wireless networks, demonstrating its effectiveness through case studies and simulations for secure communications in integrated satellite-UAV systems.
Authors:Geng Sun, Mingzhe Fan, Lei Zhang, Hongyang Pan, Jiahui Li, Chuang Zhang, Linyao Li, Changyuan Zhao, Chau Yuen
Abstract:
Wireless communication systems face significant challenges in meeting the increasing demands for higher data rates and more reliable connectivity in complex environments. Stacked intelligent metasurfaces (SIMs) have emerged as a promising technology for realizing wave-domain signal processing, with mobile SIMs offering superior communication performance compared to their fixed counterparts. In this paper, we investigate a novel unmanned aerial vehicle (UAV)-mounted SIMs (UAV-SIMs) assisted communication system within the low-altitude economy (LAE) networks paradigm, where UAVs function as both base stations that cache SIM-processed data and mobile platforms that flexibly deploy SIMs to enhance uplink communications from ground users. To maximize network capacity, we formulate a UAV-SIM-based joint optimization problem (USBJOP) that comprehensively addresses three critical aspects: the association between UAV-SIMs and users, the three-dimensional positioning of UAV-SIMs, and the phase shifts across multiple SIM layers. Due to the inherent non-convexity and NP-hardness of USBJOP, we decompose it into three sub-optimization problems, \textit{i.e.}, association between UAV-SIMs and users optimization problem (AUUOP), UAV location optimization problem (ULOP), and UAV-SIM phase shifts optimization problem (USPSOP), and solve them using an alternating optimization strategy. Specifically, we transform AUUOP and ULOP into convex forms solvable by the CVX tool, while addressing USPSOP through a generative artificial intelligence (GAI)-based hybrid optimization algorithm. Simulations demonstrate that our proposed approach significantly outperforms benchmark schemes, achieving approximately 1.5 times higher network capacity compared to suboptimal alternatives. Additionally, our proposed GAI method reduces the algorithm runtime by 10\% while maintaining solution quality.
中文: 本文提出无人机搭载堆叠智能超表面系统,通过联合优化用户关联、定位和相位偏移,在低空网络中提升上行链路容量,实现1.5倍网络容量增益并降低运行时间。
English: This paper introduces a UAV-mounted stacked intelligent metasurface system that enhances uplink capacity in low-altitude networks through joint optimization of user association, positioning, and phase shifts, achieving 1.5x higher network capacity with reduced runtime.
Authors:Geng Sun, Mingzhe Fan, Lei Zhang, Hongyang Pan, Jiahui Li, Chuang Zhang, Linyao Li, Changyuan Zhao, Chau Yuen
Abstract:
Wireless communication systems face challenges in meeting the demand for higher data rates and reliable connectivity in complex environments. Stacked intelligent metasurfaces (SIMs) have emerged as a promising technology for advanced wave-domain signal processing, where mobile SIMs can outperform fixed counterparts. In this paper, we propose a novel unmanned aerial vehicle (UAV)-mounted SIM (UAV-SIM) assisted communication system within low-altitude economy (LAE) networks, where UAVs act as both cache-enabled base stations and mobile SIM carriers to enhance uplink transmissions. To maximize network capacity, we formulate a UAV-SIM-based joint optimization problem (USBJOP) that integrates user association, UAV-SIM three-dimensional positioning, and multi-layer SIM phase shift design. Due to the non-convexity and NP-hardness of USBJOP, we decompose it into three subproblems, which are the association between UAV-SIMs and users optimization problem (AUUOP), the UAV location optimization problem (ULOP), and the UAV-SIM phase shifts optimization problem (USPSOP). Then, we solve them through an alternating optimization strategy. Specifically, AUUOP and ULOP are transformed into convex forms solvable via the CVX tool, while USPSOP is addressed by a generative artificial intelligence (GAI)-based hybrid optimization algorithm. Simulation results show that the proposed approach achieves approximately 1.5 times higher network capacity compared with suboptimal schemes, effectively mitigates multi-user interference with increasing SIM layers and meta-atoms, and reduces runtime by 10\% while maintaining solution quality, thereby demonstrating its practicality for real-world deployments.
中文: 本文提出无人机搭载堆叠智能超表面系统,通过联合优化用户关联、定位和相位偏移,在低空网络中提升上行链路容量,实现1.5倍网络容量增益并降低运行时间。
English: This paper introduces a UAV-mounted stacked intelligent metasurface system that enhances uplink capacity in low-altitude networks through joint optimization of user association, positioning, and phase shifts, achieving 1.5x higher network capacity with reduced runtime.
Authors:Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Xue Li, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou
Abstract:
The rise of graph analytics platforms has led to the development of various benchmarks for evaluating and comparing platform performance. However, existing benchmarks often fall short of fully assessing performance due to limitations in core algorithm selection, data generation processes (and the corresponding synthetic datasets), as well as the neglect of API usability evaluation. To address these shortcomings, we propose a novel graph analytics benchmark. First, we select eight core algorithms by extensively reviewing both academic and industrial settings. Second, we design an efficient and flexible data generator and produce eight new synthetic datasets as the default datasets for our benchmark. Lastly, we introduce a multi-level large language model (LLM)-based framework for API usability evaluation-the first of its kind in graph analytics benchmarks. We conduct comprehensive experimental evaluations on existing platforms (GraphX, PowerGraph, Flash, Grape, Pregel+, Ligra and G-thinker). The experimental results demonstrate the superiority of our proposed benchmark.
Chinese: 本文提出了一种新型图分析基准测试,通过精选八种核心算法、利用高效生成器创建八种合成数据集,并首次引入基于大语言模型的API可用性评估框架,在多平台综合实验中证明了其优越性。
English: This paper introduces a novel graph analytics benchmark that addresses limitations in existing benchmarks by selecting eight core algorithms, creating eight synthetic datasets with an efficient generator, and implementing the first LLM-based API usability evaluation framework, demonstrating its superiority through comprehensive experiments on multiple platforms.
Authors:Maciej Besta, Shriram Chandran, Jakub Cudak, Patrick Iff, Marcin Copik, Robert Gerstenberger, Tomasz Szydlo, Jürgen Müller, Torsten Hoefler
Abstract:
Recent advances in graph databases (GDBs) have been driving interest in large-scale analytics, yet current systems fail to support higher-order (HO) interactions beyond first-order (one-hop) relations, which are crucial for tasks such as subgraph counting, polyadic modeling, and HO graph learning. We address this by introducing a new class of systems, higher-order graph databases (HO-GDBs) that use lifting and lowering paradigms to seamlessly extend traditional GDBs with HO. We provide a theoretical analysis of OLTP and OLAP queries, ensuring correctness, scalability, and ACID compliance. We implement a lightweight, modular, and parallelizable HO-GDB prototype that offers native support for hypergraphs, node-tuples, subgraphs, and other HO structures under a unified API. The prototype scales to large HO OLTP & OLAP workloads and shows how HO improves analytical tasks, for example enhancing accuracy of graph neural networks within a GDB by 44%. Our work ensures low latency and high query throughput, and generalizes both ACID-compliant and eventually consistent systems.
中文: 本文提出高阶图数据库(HO-GDBs),通过提升和降低范式扩展传统系统以支持超越一阶关系的复杂交互,实现了一个轻量级、可扩展的原型,提供超图等统一API支持,并在分析任务中显著提升性能,如图神经网络准确率提高44%。
English: This paper introduces higher-order graph databases (HO-GDBs) that extend traditional systems to support complex interactions beyond first-order relations, featuring a scalable prototype with unified API support for hypergraphs and improved analytical performance, such as boosting graph neural network accuracy by 44%.
Authors:Jiachi Chen, Yiming Shen, Jiashuo Zhang, Zihao Li, John Grundy, Zhenzhe Shao, Yanlin Wang, Jiashui Wang, Ting Chen, Zibin Zheng
Abstract:
High-quality smart contract vulnerability datasets are critical for evaluating security tools and advancing smart contract security research. Two major limitations of current manual dataset construction are (1) labor-intensive and error-prone annotation processes limiting the scale, quality, and evolution of the dataset, and (2) absence of standardized classification rules results in inconsistent vulnerability categories and labeling results across different datasets. To address these limitations, we present FORGE, the first automated approach for constructing smart contract vulnerability datasets. FORGE leverages an LLM-driven pipeline to extract high-quality vulnerabilities from real-world audit reports and classify them according to the CWE, the most widely recognized classification in software security. FORGE employs a divide-and-conquer strategy to extract structured and self-contained vulnerability information from these reports. Additionally, it uses a tree-of-thoughts technique to classify the vulnerability information into the hierarchical CWE classification. To evaluate FORGE's effectiveness, we run FORGE on 6,454 real-world audit reports and generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories. Manual assessment of the dataset demonstrates high extraction precision and classification consistency with human experts (precision of 95.6% and inter-rater agreement k-$α$ of 0.87). We further validate the practicality of our dataset by benchmarking 13 existing security tools on our dataset. The results reveal the significant limitations in current detection capabilities. Furthermore, by analyzing the severity-frequency distribution patterns through a unified CWE perspective in our dataset, we highlight inconsistency between current smart contract research focus and priorities identified from real-world vulnerabilities...
中文: FORGE提出首个自动化方法,通过大语言模型驱动的流程从真实审计报告中构建智能合约漏洞数据集,以95.6%的精确度实现标准化CWE分类,并揭示了现有安全工具的重要局限性。
English: FORGE introduces an automated LLM-driven pipeline to construct high-quality smart contract vulnerability datasets from audit reports, achieving 95.6% precision and revealing limitations in current security tools through standardized CWE classification.
Authors:Hongzhou Rao, Yanjie Zhao, Wenjie Zhu, Ling Xiao, Meizhen Wang, Haoyu Wang
Abstract:
Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing methods often fail to produce complex and diverse variations, struggle with complex cross-file dependencies, and lack support for multiple programming languages, which limits their effectiveness in enhancing LLM evaluations for coding tasks. To fill this gap, we propose CodeMorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage. CodeMorph consists of two main components that work together to enhance the perturbation process. The first component employs 26 semantic-preserving transformation methods to iteratively perturb code, generating diverse variations while ensuring that the modified code remains compilable. The second component introduces a genetic algorithm-based selection algorithm, PESO, to identify the more effective perturbation method for each iteration by targeting lower similarity scores between the perturbed and original code, thereby enhancing overall perturbation effectiveness. Experimental results demonstrate that after applying CodeMorph, the accuracy of the LLM on code completion tasks across five programming languages decreased by an average of 24.67%, with Python showing the most significant reduction at 45%. The similarity score of code optimized by PESO is, on average, 7.01% lower than that of randomly perturbed code, peaking at a reduction of 42.86%.
中文摘要:CodeMorph通过语义保持的代码变换和遗传算法解决代码大模型的基准泄露问题,在五种编程语言上使模型准确率平均下降24.67%,同时确保代码可编译性。
English Summary: CodeMorph addresses benchmark leakage in Code LLMs through semantic-preserving transformations and a genetic algorithm, reducing model accuracy by 24.67% on average across five programming languages while maintaining code compilability.
Authors:Peter Belcak, Greg Heinrich, Jan Kautz, Pavlo Molchanov
Abstract:
Finetuning language models for a new domain inevitably leads to the deterioration of their general performance. This becomes more pronounced the more limited the finetuning data resource.
We introduce minifinetuning (MFT), a method for language model domain adaptation that considerably reduces the effects of overfitting-induced degeneralization in low-data settings and which does so in the absence of any pre-training data for replay. MFT demonstrates 2-10x more favourable specialization-to-degeneralization ratios than standard finetuning across a wide range of models and domains and exhibits an intrinsic robustness to overfitting when data in the new domain is scarce and down to as little as 500 samples.
Employing corrective self-distillation that is individualized on the sample level, MFT outperforms parameter-efficient finetuning methods, demonstrates replay-like degeneralization mitigation properties, and is composable with either for a combined effect.
中文: 小样本微调(MFT)是一种新颖的领域自适应方法,能在低数据条件下有效缓解语言模型因过拟合导致的性能退化,其效果优于标准微调2-10倍,且仅需500个样本即展现稳健性能。
English: Minifinetuning (MFT) is a novel domain adaptation method that effectively mitigates overfitting-induced performance degradation in language models under low-data conditions, outperforming standard finetuning by 2-10x and demonstrating robustness with as few as 500 samples.
Authors:Yuxuan Jiang, Siyue Teng, Qiang Zhu, Chen Feng, Chengxi Zeng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull
Abstract:
This paper presents a general-purpose video super-resolution (VSR) method, dubbed VSR-HE, specifically designed to enhance the perceptual quality of compressed content. Targeting scenarios characterized by heavy compression, the method upscales low-resolution videos by a ratio of four, from 180p to 720p or from 270p to 1080p. VSR-HE adopts hierarchical encoding transformer blocks and has been sophisticatedly optimized to eliminate a wide range of compression artifacts commonly introduced by H.265/HEVC encoding across various quantization parameter (QP) levels. To ensure robustness and generalization, the model is trained and evaluated under diverse compression settings, allowing it to effectively restore fine-grained details and preserve visual fidelity. The proposed VSR-HE has been officially submitted to the ICME 2025 Grand Challenge on VSR for Video Conferencing (Team BVI-VSR), under both the Track 1 (General-Purpose Real-World Video Content) and Track 2 (Talking Head Videos).
中文: 本文提出了一种通用视频超分辨率方法VSR-HE,通过采用分层编码变换器模块,能够将重度压缩视频放大四倍,在有效消除H.265/HEVC编码伪影的同时保持视觉保真度。
English: This paper introduces VSR-HE, a general-purpose video super-resolution method that uses hierarchical encoding transformer blocks to upscale heavily compressed videos by four times while effectively removing H.265/HEVC artifacts and preserving visual quality.
Authors:Chongjun Ouyang, Zhaolin Wang, Yuanwei Liu, Zhiguo Ding
Abstract:
Unlike conventional systems using a fixed-location antenna, the channel capacity of the pinching-antenna system (PASS) is determined by the activated positions of pinching antennas. This article characterizes the capacity region of multiuser PASS, where a single pinched waveguide is deployed to enable both uplink and downlink communications. The capacity region of the uplink channel is first characterized. \romannumeral1) For the single-pinch case, closed-form expressions are derived for the optimal antenna activation position, along with the corresponding capacity region and the achievable data rate regions under time-division multiple access (TDMA) and frequency-division multiple access (FDMA). It is proven that the capacity region of PASS encompasses that of conventional fixed-antenna systems, and that the FDMA rate region contains the TDMA rate region. \romannumeral2) For the multiple-pinch case, inner and outer bounds on the capacity region are derived using an element-wise alternating antenna position optimization technique and the Cauchy-Schwarz inequality, respectively. The achievable FDMA rate region is also derived using the same optimization framework, while the TDMA rate region is obtained through an antenna position refinement approach. The analysis is then extended to the downlink PASS using the uplink-downlink duality framework. It is proven that the relationships among the downlink capacity and rate regions are consistent with those in the uplink case. Numerical results demonstrate that: \romannumeral1) the derived bounds closely approximate the exact capacity region, \romannumeral2) PASS yields a significantly enlarged capacity region compared to conventional fixed-antenna systems, and \romannumeral3) in the multiple-pinch case, TDMA and FDMA are capable of approaching the channel capacity limit.
Chinese: 本文分析了多用户夹持天线系统的容量区域,证明其优于传统固定天线系统,并针对单次和多次夹持场景,推导了上下行通信中的最优天线配置方案。
English: This article characterizes the capacity region of multiuser pinching-antenna systems (PASS), proving it surpasses conventional fixed-antenna systems and deriving optimal configurations for both single and multiple pinch scenarios in uplink and downlink communications.
Authors:Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
Abstract:
Short-utterance speaker verification presents significant challenges due to the limited information in brief speech segments, which can undermine accuracy and reliability. Recently, zero-shot text-to-speech (ZS-TTS) systems have made considerable progress in preserving speaker identity. In this study, we explore, for the first time, the use of ZS-TTS systems for test-time data augmentation for speaker verification. We evaluate three state-of-the-art pre-trained ZS-TTS systems, NatureSpeech 3, CosyVoice, and MaskGCT, on the VoxCeleb 1 dataset. Our experimental results show that combining real and synthetic speech samples leads to 10%-16% relative equal error rate (EER) reductions across all durations, with particularly notable improvements for short utterances, all without retraining any existing systems. However, our analysis reveals that longer synthetic speech does not yield the same benefits as longer real speech in reducing EERs. These findings highlight the potential and challenges of using ZS-TTS for test-time speaker verification, offering insights for future research.
中文摘要:本研究首次探索使用零样本文本转语音系统进行测试时数据增强,显著提升了短语音说话人验证的准确性,在不重新训练系统的情况下实现了10%-16%的相对等错误率降低。
English Summary: This study demonstrates that using zero-shot text-to-speech systems for test-time data augmentation significantly improves speaker verification accuracy, particularly for short utterances, achieving 10%-16% relative EER reductions without system retraining.
Authors:Cui Zhang, Maoxin Ji, Qiong Wu, Pingyi Fan, Qiang Fan
Abstract:
As Internet of Vehicles (IoV) technology continues to advance, edge computing has become an important tool for assisting vehicles in handling complex tasks. However, the process of offloading tasks to edge servers may expose vehicles to malicious external attacks, resulting in information loss or even tampering, thereby creating serious security vulnerabilities. Blockchain technology can maintain a shared ledger among servers. In the Raft consensus mechanism, as long as more than half of the nodes remain operational, the system will not collapse, effectively maintaining the system's robustness and security. To protect vehicle information, we propose a security framework that integrates the Raft consensus mechanism from blockchain technology with edge computing. To address the additional latency introduced by blockchain, we derived a theoretical formula for system delay and proposed a convex optimization solution to minimize the system latency, ensuring that the system meets the requirements for low latency and high reliability. Simulation results demonstrate that the optimized data extraction rate significantly reduces system delay, with relatively stable variations in latency. Moreover, the proposed optimization solution based on this model can provide valuable insights for enhancing security and efficiency in future network environments, such as 5G and next-generation smart city systems.
中文: 本研究提出了一种融合区块链Raft共识机制与边缘计算的安全框架,以保护车联网中的车辆数据,并通过凸优化方法最小化系统延迟,确保低延迟和高可靠性。
English: The study proposes a blockchain-based security framework integrating Raft consensus with edge computing to protect vehicle data in IoV systems, while employing convex optimization to minimize latency and ensure reliability.
Authors:Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, Jiwen Lu
Abstract:
3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.
中文: 本文提出QuadricFormer模型,通过采用几何表达力强的超二次曲面作为场景基元,以更少的基元有效表示复杂结构,克服了密集体素和椭球高斯方法的局限性,在nuScenes数据集上实现了最优性能并保持了卓越效率。
English: This paper introduces QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction that overcomes the limitations of dense voxel and ellipsoidal Gaussian methods by using geometrically expressive superquadrics to represent complex structures with fewer primitives, achieving state-of-the-art performance on the nuScenes dataset.
Authors:Kaifeng He, Mingwei Liu, Chong Wang, Zike Li, Yanlin Wang, Xin Peng, Zibin Zheng
Abstract:
Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at decision points where uncertainty strongly affects program correctness. Conventional strategies such as greedy decoding treat all tokens uniformly and fail to capture the uncertainty characteristics unique to code, often resulting in suboptimal outputs. In this work, we conduct an empirical analysis and show that a large fraction of generation errors arises from token misranking at high-uncertainty positions, where the correct token is available but not prioritized. To address this, we introduce AdaDec, an adaptive decoding framework that employs a lookahead-based, uncertainty-aware pause-and-rerank mechanism. AdaDec automatically learns model-specific uncertainty thresholds and selectively invokes reranking when high uncertainty is detected, leveraging lookahead to refine token choice. Across HumanEval+, MBPP+, and DevEval benchmarks, AdaDec yields substantial improvements, achieving up to 20.9% absolute gains in Pass@1 accuracy compared with greedy decoding, while consistently outperforming prior adaptive decoding approaches such as AdapT. Furthermore, by applying reranking only when necessary, AdaDec reduces computational overhead and latency, enhancing efficiency alongside reliability. These findings underscore the value of uncertainty-guided decoding strategies in advancing the robustness and practicality of LLM-based code generation.
中文摘要:AdaDec自适应解码框架通过基于前瞻的不确定性感知暂停与重排机制,在检测到高不确定性时优化词汇选择,显著提升代码生成准确率最高达20.9%,同时有效降低计算负载。
English Summary: AdaDec, an adaptive decoding framework, significantly improves code generation by using uncertainty-aware pause-and-rerank mechanisms to correct token misranking at critical decision points, achieving up to 20.9% higher accuracy with reduced computational overhead.
Authors:Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li
Abstract:
We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.
中文: 感知万物模型(PAM)是一个高效框架,通过将大语言模型集成到SAM 2中,实现了同步对象分割与多样化语义输出,在保持轻量化的同时显著提升了区域级视觉理解任务的性能。
English: The Perceive Anything Model (PAM) is an efficient framework that enhances SAM 2 with Large Language Models to perform simultaneous object segmentation and generate diverse semantic outputs, achieving faster performance with reduced resource usage for comprehensive region-level visual understanding.
Authors:Yinglin Xie, Xinyi Hou, Yanjie Zhao, Shenao Wang, Kai Chen, Haoyu Wang
Abstract:
Vector database management systems (VDBMSs) play a crucial role in facilitating semantic similarity searches over high-dimensional embeddings from diverse data sources. While VDBMSs are widely used in applications such as recommendation, retrieval-augmented generation (RAG), and multimodal search, their reliability remains underexplored. Traditional database reliability models cannot be directly applied to VDBMSs because of fundamental differences in data representation, query mechanisms, and system architecture. To address this gap, we present the first large-scale empirical study of software defects in VDBMSs. We manually analyzed 1,671 bug-fix pull requests from 15 widely used open-source VDBMSs and developed a comprehensive taxonomy of bugs based on symptoms, root causes, and developer fix strategies. Our study identifies five categories of bug symptoms, with more than half manifesting as functional failures. We further reveal 31 recurring fault patterns and highlight failure modes unique to vector search systems. In addition, we summarize 12 common fix strategies, whose distribution underscores the critical importance of correct program logic. These findings provide actionable insights into VDBMS reliability challenges and offer guidance for building more robust future systems.
中文: 本研究首次对向量数据库管理系统(VDBMS)的软件缺陷进行大规模实证分析,通过人工检查15个开源项目的1,671份错误报告,识别出关键错误模式和修复策略,以解决这些系统的可靠性问题。
English: This study presents the first large-scale empirical analysis of software defects in vector database management systems (VDBMSs), identifying key bug patterns and fix strategies through manual examination of 1,671 bug reports from 15 open-source projects to address reliability gaps in these systems.
Authors:Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin
Abstract:
Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth labels are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator with an effective training strategy to guide the matching-based GED solver toward generating high-quality node matching without the need for ground-truth labels. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.
Chinese: 图编辑距离(GED)是一个NP难问题,而提出的GEDRanker是一种基于GAN的无监督框架,通过偏好感知判别器引导节点匹配,无需真实标签即可有效计算GED。
English: Graph Edit Distance (GED) is an NP-hard problem, and the proposed GEDRanker is an unsupervised GAN-based framework that effectively computes GED without requiring ground-truth labels by using a preference-aware discriminator to guide node matching.
Authors:Zehao Wu, Yanjie Zhao, Haoyu Wang
Abstract:
As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta's LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance.
To fill this gap, treating LLMs as software artifacts requiring provenance tracking, we present TensorGuard, a gradient-based fingerprinting framework for LLM similarity detection and family classification. Our approach extracts model-intrinsic behavioral signatures by analyzing gradient responses to random input perturbations across tensor layers, operating independently of training data, watermarks, or specific model formats. TensorGuard supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering.
中文摘要:TensorGuard是一种基于梯度的指纹框架,通过分析大语言模型的行为特征来检测未经授权的模型衍生,在不依赖训练数据或水印的情况下实现了94%的模型谱系分类准确率。
English Summary: TensorGuard is a gradient-based fingerprinting framework that detects unauthorized derivations of Large Language Models by analyzing their behavioral signatures, achieving 94% accuracy in classifying model lineages without relying on training data or watermarks.
Authors:Adrian Azzarelli, Ge Gao, Ho Man Kwan, Fan Zhang, Nantheera Anantrasirichai, Ollie Moolan-Feroze, David Bull
Abstract:
As research on neural volumetric video reconstruction and compression flourishes, there is a need for diverse and realistic datasets, which can be used to develop and validate reconstruction and compression models. However, existing volumetric video datasets lack diverse content in terms of both semantic and low-level features that are commonly present in real-world production pipelines. In this context, we propose a new dataset, ViVo, for VolumetrIc VideO reconstruction and compression. The dataset is faithful to real-world volumetric video production and is the first dataset to extend the definition of diversity to include both human-centric characteristics (skin, hair, etc.) and dynamic visual phenomena (transparent, reflective, liquid, etc.). Each video sequence in this database contains raw data including fourteen multi-view RGB and depth video pairs, synchronized at 30FPS with per-frame calibration and audio data, and their associated 2-D foreground masks and 3-D point clouds. To demonstrate the use of this database, we have benchmarked three state-of-the-art (SotA) 3-D reconstruction methods and two volumetric video compression algorithms. The obtained results evidence the challenging nature of the proposed dataset and the limitations of existing datasets for both volumetric video reconstruction and compression tasks, highlighting the need to develop more effective algorithms for these applications. The database and the associated results are available at https://vivo-bvicr.github.io/
中文: ViVo数据集通过融合以人为中心的特征和动态视觉现象,解决了现有体视频数据集多样性不足的问题,其提供的原始数据和基准测试揭示了当前重建与压缩方法的局限性。
English: The ViVo dataset addresses the lack of diversity in existing volumetric video datasets by incorporating both human-centric features and dynamic visual phenomena, providing raw data and benchmarks that reveal current reconstruction and compression methods' limitations.
Authors:Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Abstract:
Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.
Chinese: 本文提出了一种新颖框架,利用预训练的视角视频模型,通过引入ViewPoint地图表示和全景-视角注意力机制,有效弥合模态差异,在生成动态且空间一致的360度全景视频方面实现了最先进的性能。
English: This paper introduces a novel framework that leverages pretrained perspective video models to generate high-quality panoramic videos by employing a ViewPoint map representation and a Pano-Perspective attention mechanism, effectively bridging the modality gap and achieving state-of-the-art results in dynamic and spatially consistent 360-degree video synthesis.
Authors:Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang
Abstract:
Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to better maintain the relative positioning among dancers. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.
中文: 本文提出TCDiff++框架,通过舞者定位嵌入、距离一致性损失和足部适配器等创新方法,有效解决了群体舞蹈生成中的碰撞、滑步和位置突变问题,在长序列生成中实现了最优性能。
English: This paper introduces TCDiff++, an end-to-end framework that addresses multi-dancer collisions, foot sliding, and abrupt swapping in music-driven group dance generation through specialized embeddings and sampling strategies, achieving state-of-the-art performance in long-duration scenarios.
Authors:Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang
Abstract:
Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to encode temporal and identity information. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.
中文: 本文提出TCDiff++框架,通过舞者定位嵌入、距离一致性损失和足部适配器等创新方法,有效解决了群体舞蹈生成中的碰撞、滑步和位置突变问题,在长序列生成中实现了最优性能。
English: This paper introduces TCDiff++, an end-to-end framework that addresses multi-dancer collisions, foot sliding, and abrupt swapping in music-driven group dance generation through specialized embeddings and sampling strategies, achieving state-of-the-art performance in long-duration scenarios.
Authors:Xianren Zhang, Hui Liu, Delvin Ce Zhang, Xianfeng Tang, Qi He, Dongwon Lee, Suhang Wang
Abstract:
Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the ``forget'' sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential defenses. To improve stealthiness, we introduce an embedding alignment loss that minimizes the difference between the perturbed and denoised image embeddings, ensuring the attack is semantically unnoticeable. Experimental results show that SUA can effectively recover unlearned information from MLLMs. Furthermore, the learned noise generalizes well: a single perturbation trained on a subset of samples can reveal forgotten content in unseen images. This indicates that knowledge reappearance is not an occasional failure, but a consistent behavior.
中文: 多模态大语言模型存在记忆敏感数据的风险,其遗忘方法可能仅隐藏而非真正删除信息,为此提出的隐蔽遗忘攻击框架通过通用噪声和嵌入对齐技术,能在保持语义隐蔽的同时有效恢复已遗忘内容。
English: Multimodal Large Language Models risk memorizing sensitive data, prompting unlearning methods that may only hide rather than erase information, leading to the proposed Stealthy Unlearning Attack framework which uses universal noise to recover unlearned content while maintaining semantic stealth through embedding alignment.
Authors:Yejing Wang, Shengyu Zhou, Jinyu Lu, Qidong Liu, Xinhang Li, Wenlin Zhang, Feng Li, Pengjie Wang, Jian Xu, Bo Zheng, Xiangyu Zhao
Abstract:
Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.
中文摘要:本文提出GFlowGR框架,通过整合协同知识并利用多样化生成特性,有效缓解生成式推荐中的曝光偏差问题。
English Summary: This paper introduces GFlowGR, a GFlowNets-based fine-tuning framework that addresses the exposure bias problem in generative recommendations by integrating collaborative knowledge and leveraging diverse generation properties.
Authors:Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
Abstract:
Solutions for defending against deepfake speech fall into two categories: proactive watermarking models and passive conventional deepfake detectors. While both address common threats, their differences in training, optimization, and evaluation prevent a unified protocol for joint evaluation and selecting the best solutions for different cases. This work proposes a framework to evaluate both model types in deepfake speech detection. To ensure fair comparison and minimize discrepancies, all models were trained and tested on common datasets, with performance evaluated using a shared metric. We also analyze their robustness against various adversarial attacks, showing that different models exhibit distinct vulnerabilities to different speech attribute distortions. Our training and evaluation code is available at Github.
中文摘要:本研究提出了一个统一框架,通过使用公共数据集和共享指标对主动水印模型与被动深度伪造检测器进行公平评估,同时分析了它们对不同语音属性篡改的独特脆弱性。
English Summary: This study introduces a unified framework to fairly evaluate both proactive watermarking and passive deepfake detection models by training and testing them on common datasets with shared metrics, while also analyzing their distinct vulnerabilities to adversarial attacks.
Authors:Bin Zhu, Hailong Yin, Jingjing Chen, Yu-Gang Jiang
Abstract:
Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models' susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.
中文: 最新研究表明,顶尖推理模型在遭遇误导性输入时准确率显著下降,这暴露了逐步推理能力与信念坚持之间的关键差距。
English: Recent research reveals that leading reasoning models experience significant accuracy drops when exposed to manipulative gaslighting prompts, exposing critical vulnerabilities in their belief persistence despite advanced reasoning capabilities.
Authors:Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang, Yu Tsao
Abstract:
Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. These features are then fused and processed by a CNN-BLSTM with attention, followed by multi-task learning to simultaneously predict PESQ and STOI. Evaluations on the LRS3-TED dataset, augmented with noise from the DEMAND corpus, show that our model outperforms the audio-only baseline. Under seen noise conditions, it improves LCC by 9.61% (0.8397->0.9205) for PESQ and 11.47% (0.7403->0.8253) for STOI. These results highlight the effectiveness of incorporating visual cues in enhancing the accuracy of non-intrusive speech assessment.
中文: 本研究提出一种融合音频和视觉线索的多模态框架,用于预测语音质量与清晰度指标,在噪声增强数据集上的实验表明,该模型通过显著提升相关系数,有效超越了纯音频基线方法的性能。
English: This study introduces a multimodal framework that combines audio and visual inputs to predict speech quality and intelligibility scores, demonstrating superior performance over audio-only methods through significant improvements in correlation metrics on noise-augmented data.
Authors:Jingze Ding, Zijian Zhou, Lipeng Zhu, Yuping Zhao, Bingli Jiao, Rui Zhang
Abstract:
This paper investigates energy efficiency maximization for movable antenna (MA)-aided multi-user uplink communication systems by considering the time delay and energy consumption incurred by practical antenna movement. We first examine the special case with a single user and propose an optimization algorithm based on the one-dimensional (1D) exhaustive search to maximize the user's energy efficiency. Moreover, we derive an upper bound on the energy efficiency and analyze the conditions required to achieve this performance bound under different numbers of channel paths. Then, for the general multi-user scenario, we propose an iterative algorithm to fairly maximize the minimum energy efficiency among all users. Simulation results demonstrate the effectiveness of the proposed scheme in improving energy efficiency compared to existing MA schemes that do not account for movement-related costs, as well as the conventional fixed-position antenna (FPA) scheme. In addition, the results show the robustness of the proposed scheme to imperfect channel state information (CSI) and provide valuable insights for practical system deployment.
中文: 本文针对可移动天线系统中的时延和能耗问题,提出了优化算法以最大化能量效率,相比现有方法展现出更优性能和鲁棒性。
English: This paper presents algorithms to maximize energy efficiency in movable antenna systems by addressing movement-related delays and energy costs, showing superior performance and robustness compared to existing methods.
Authors:Lin Chen, Yunke Zhang, Jie Feng, Haoye Chai, Honglin Zhang, Bingbing Fan, Yibo Ma, Shiyuan Zhang, Nian Li, Tianhui Liu, Nicholas Sukiennik, Keyu Zhao, Yu Li, Ziyi Liu, Fengli Xu, Yong Li
Abstract:
Recent advances in large language models (LLMs) have enabled the development of AI agents that exhibit increasingly human-like behaviors, including planning, adaptation, and social dynamics across diverse, interactive, and open-ended scenarios. These behaviors are not solely the product of the internal architectures of the underlying models, but emerge from their integration into agentic systems operating within specific contexts, where environmental factors, social cues, and interaction feedbacks shape behavior over time. This evolution necessitates a new scientific perspective: AI Agent Behavioral Science. Rather than focusing only on internal mechanisms, this perspective emphasizes the systematic observation of behavior, design of interventions to test hypotheses, and theory-guided interpretation of how AI agents act, adapt, and interact over time. We systematize a growing body of research across individual agent, multi-agent, and human-agent interaction settings, and further demonstrate how this perspective informs responsible AI by treating fairness, safety, interpretability, accountability, and privacy as behavioral properties. By unifying recent findings and laying out future directions, we position AI Agent Behavioral Science as a necessary complement to traditional model-centric approaches, providing essential tools for understanding, evaluating, and governing the real-world behavior of increasingly autonomous AI systems.
Chinese: 人工智能代理在情境交互中展现出类人行为,催生了AI代理行为科学,该学科通过系统观察和解释代理行为,将负责任AI的要素视为行为属性进行研究。
English: The emergence of human-like behaviors in AI agents through contextual interactions has led to the development of AI Agent Behavioral Science, which focuses on observing and interpreting agent behavior systematically while addressing responsible AI concerns as behavioral properties.
Authors:Peijie Liu, Fengli Xu, Yong Li
Abstract:
Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.
中文: 本研究提出动态思维链方法,通过基于词元概率分布的指标动态选择推理方式,在保持高精度的同时显著降低计算消耗,并为理解思维链机制提供了新视角。
English: The study introduces Dynamic CoT, a method that uses token probability indicators to dynamically choose between Chain-of-Thought reasoning and direct answers, reducing token usage by over 35% while maintaining high accuracy and offering insights into CoT mechanisms.
Authors:Qingbin Zeng, Ruotong Zhao, Jinzhu Mao, Haoyang Li, Fengli Xu, Yong Li
Abstract:
Modeling urban crime is an important yet challenging task that requires understanding the subtle visual, social, and cultural cues embedded in urban environments. Previous work has mainly focused on rule-based agent-based modeling (ABM) and deep learning methods. ABMs offer interpretability of internal mechanisms but exhibit limited predictive accuracy. In contrast, deep learning methods are often effective in prediction but are less interpretable and require extensive training data. Moreover, both lines of work lack the cognitive flexibility to adapt to changing environments. Leveraging the capabilities of large language models (LLMs), we propose CrimeMind, a novel LLM-driven ABM framework for simulating urban crime within a multi-modal urban context. A key innovation of our design is the integration of the Routine Activity Theory (RAT) into the agentic workflow of CrimeMind, enabling it to process rich multi-modal urban features and reason about criminal behavior. However, RAT requires LLM agents to infer subtle cues in evaluating environmental safety as part of assessing guardianship, which can be challenging for LLMs. To address this, we collect a small-scale human-annotated dataset and align CrimeMind's perception with human judgment via a training-free textual gradient method. Experiments across four major U.S. cities demonstrate that CrimeMind outperforms both traditional ABMs and deep learning baselines in crime hotspot prediction and spatial distribution accuracy, achieving up to a 24% improvement over the strongest baseline. Furthermore, we conduct counterfactual simulations of external incidents and policy interventions and it successfully captures the expected changes in crime patterns, demonstrating its ability to reflect counterfactual scenarios. Overall, CrimeMind enables fine-grained modeling of individual behaviors and facilitates evaluation of real-world interventions.
中文:提出的CrimeMind框架将大语言模型与基于主体的建模及日常活动理论相结合,提升了城市犯罪模拟的预测精度和适应性,通过免训练方法使其感知与人类判断保持一致,优于现有方法。
English: The proposed CrimeMind framework integrates large language models with agent-based modeling and Routine Activity Theory to enhance urban crime simulation, achieving superior predictive accuracy and adaptability over existing methods while aligning with human judgment through a training-free approach.
Authors:Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li
Abstract:
Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at https://anonymous.4open.science/r/R2_Reasoner .
中文: R2-Reasoner是一种创新框架,通过基于复杂度的动态子任务路由实现异构语言模型的协作推理,利用强化模型路由和分阶段训练,在保持高精度的同时显著降低成本。
English: R2-Reasoner is a novel framework that enables collaborative reasoning across heterogeneous language models by dynamically routing subtasks based on complexity, achieving significant cost reduction while maintaining high accuracy through reinforced model routing and staged training.
Authors:Keyu Zhao, Fengli Xu, Yong Li
Abstract:
Driven by advances in Large Language Models (LLMs), integrating them into recommendation tasks has gained interest due to their strong semantic understanding and prompt flexibility. Prior work encoded user-item interactions or metadata into prompts for recommendations. In parallel, LLM reasoning, boosted by test-time scaling and reinforcement learning, has excelled in fields like mathematics and code, where reasoning traces and correctness signals are clear, enabling high performance and interpretability. However, directly applying these reasoning methods to recommendation is ineffective because user feedback is implicit and lacks reasoning supervision. To address this, we propose $\textbf{R2Rec}$, a reasoning-enhanced recommendation framework that samples interaction chains from the user-item graph and converts them into structured interaction-of-thoughts via a progressive masked prompting strategy, with each thought representing stepwise reasoning grounded in interaction context. This allows LLMs to simulate step-by-step decision-making based on implicit patterns. We design a two-stage training pipeline: supervised fine-tuning teaches basic reasoning from high-quality traces, and reinforcement learning refines reasoning via reward signals, alleviating sparse explicit supervision. Experiments on three real-world datasets show R2Rec outperforms classical and LLM-based baselines with an average $\textbf{10.48%}$ improvement in HitRatio@1 and $\textbf{131.81%}$ gain over the original LLM. Furthermore, the explicit reasoning chains enhance interpretability by revealing the decision process. Our code is available at: https://anonymous.4open.science/r/R2Rec-7C5D.
中文摘要:R2Rec框架通过将用户交互链转化为结构化思维过程,使大语言模型能够基于隐式交互模式进行逐步推理,结合监督微调与强化学习的双阶段训练方法,显著提升了推荐系统的性能与可解释性。
English Summary: The R2Rec framework enhances recommendation systems by enabling large language models to perform stepwise reasoning on user interaction chains, achieving significant performance improvements through a two-stage training process combining supervised fine-tuning and reinforcement learning.
Authors:Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang
Abstract:
Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
中文:提出的Safe框架通过将推理步骤转化为可验证的数学命题,利用Lean 4形式化验证来检测思维链中的幻觉问题,在多个模型和数据集上显著提升性能的同时提供可解释的证据。
English: The proposed Safe framework uses Lean 4 formal verification to detect hallucinations in Chain-of-Thought reasoning by converting each step into verifiable mathematical claims, significantly improving performance while providing interpretable evidence across multiple models and datasets.
Authors:Min Fu, Lipeng Zhu, Rui Zhang
Abstract:
Movable antenna (MA) has been recognized as a promising technology to improve communication performance in future wireless networks such as 6G. To unleash its potential, this paper proposes a novel architecture, namely extremely large-scale MA (XL-MA), which allows flexible antenna/subarray positioning over an extremely large spatial region for effectively enhancing near-field effects and spatial multiplexing performance. In particular, this paper studies an uplink XL-MA-enabled multiuser system, where single-antenna users distributed in a coverage area are served by a base station (BS) equipped with multiple movable subarrays. We begin by presenting a spatially non-stationary channel model to capture the near-field effects, including positiondependent large-scale channel gains and line-of-sight visibility. To evaluate system performance, we further derive a closedform approximation of the expected weighted sum rate under maximum ratio combining (MRC), revealing that optimizing XLMA placement enhances user channel power gain to increase desired signal power and reduces channel correlation to decreases multiuser interference. Building upon this, we formulate an antenna placement optimization problem to maximize the expected weighted sum rate, leveraging statistical channel conditions and user distribution. To efficiently solve this challenging non-linear binary optimization problem, we propose a polynomial-time successive replacement algorithm. Simulation results demonstrate that the proposed XL-MA placement strategy achieves nearoptimal performance, significantly outperforming benchmark schemes based on conventional fixed-position antennas.
中文摘要:本文提出了一种极大规模可移动天线(XL-MA)架构,通过优化天线布局增强近场效应和空间复用性能,所设计的多项式时间逐次替换算法在6G无线网络中显著优于传统固定天线方案。
English Summary: This paper introduces an extremely large-scale movable antenna (XL-MA) architecture that enhances near-field effects and spatial multiplexing in 6G wireless networks, proposing an efficient algorithm for antenna placement optimization that significantly outperforms fixed-antenna systems.
Authors:Mikhail Persiianov, Jiawei Chen, Petr Mokrov, Alexander Tyurin, Evgeny Burnaev, Alexander Korotin
Abstract:
Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional $\textit{end-to-end}$ adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.
中文: 本研究提出 $\texttt{iJKOnet}$ 方法,将逆优化技术与 JKO 框架相结合,通过端到端对抗训练学习群体动力学,在无需严格架构限制的情况下提供理论保证并实现优于现有方法的性能。
English: This study presents $\texttt{iJKOnet}$, an inverse optimization approach integrated with the JKO scheme to learn population dynamics through adversarial training, offering theoretical guarantees and superior performance without restrictive architectural requirements.
Authors:Pengfei He, Yue Xing, Shen Dong, Juanhui Li, Zhenwei Dai, Xianfeng Tang, Hui Liu, Han Xu, Zhen Xiang, Charu C. Aggarwal, Hui Liu
Abstract:
This paper argues that a comprehensive vulnerability analysis is essential for building trustworthy Large Language Model-based Multi-Agent Systems (LLM-MAS). These systems, which consist of multiple LLM-powered agents working collaboratively, are increasingly deployed in high-stakes applications but face novel security threats due to their complex structures. While single-agent vulnerabilities are well-studied, LLM-MAS introduces unique attack surfaces through inter-agent communication, trust relationships, and tool integration that remain significantly underexplored. We present a systematic framework for vulnerability analysis of LLM-MAS that unifies diverse research. For each type of vulnerability, we define formal threat models grounded in practical attacker capabilities and illustrate them using real-world LLM-MAS applications. This formulation enables rigorous quantification of vulnerability across different architectures and provides a foundation for designing meaningful evaluation benchmarks. Our analysis reveals that LLM-MAS faces elevated risk due to compositional effects -- vulnerabilities in individual components can cascade through agent communication, creating threat models not present in single-agent systems. We conclude by identifying critical open challenges: (1) developing benchmarks specifically tailored to LLM-MAS vulnerability assessment, (2) considering new potential attacks specific to multi-agent architectures, and (3) implementing trust management systems that can enforce security in LLM-MAS. This research provides essential groundwork for future efforts to enhance LLM-MAS trustworthiness as these systems continue their expansion into critical applications.
中文: 本文强调对基于大语言模型的多智能体系统进行全面漏洞分析至关重要,提出了一个系统性框架来解决因智能体间通信和组合效应引发的独特安全威胁,并指出了未来研究的关键挑战。
English: This paper emphasizes the necessity of comprehensive vulnerability analysis for trustworthy Large Language Model-based Multi-Agent Systems (LLM-MAS), proposing a systematic framework to address unique security threats arising from inter-agent communication and compositional risks, while identifying key challenges for future research.
Authors:Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
Abstract:
Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to extend their temporal context length. Empirically, FlexSelect delivers strong gains across multiple long-video benchmarks including VideoMME, MLVU, LongVB, and LVBench. Moreover, it achieves significant speed-ups (for example, up to 9 times on a LLaVA-Video-7B model), highlighting FlexSelect's promise for efficient long-form video understanding. Project page available at: https://yunzhuzhang0918.github.io/flex_select
中文摘要:FlexSelect是一种高效的令牌选择策略,通过跨模态注意力识别并保留语义相关内容,显著提升长视频理解能力,在多种VideoLLM架构中实现计算加速和性能提升。
English Summary: FlexSelect is an efficient token selection strategy that enhances long-form video understanding by identifying and retaining semantically relevant content through cross-modal attention, enabling significant computational speed-ups and improved performance across various VideoLLM architectures.
Authors:Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi
Abstract:
The Mixture of Experts (MoE) architecture has emerged as a prominent strategy for scaling large language models (LLMs), effectively leveraging sparse activation and facilitating task-specific personalization. However, current federated learning (FL) approaches are primarily designed for dense models, making them unable to directly exploit the sparsity inherent in MoE architectures. Treating MoE models as dense networks in federated scenarios results in excessive communication overhead and computational costs, undermining the potential for personalized knowledge sharing. To address these challenges, we propose FLEx (Federated LLMs with Personalized Experts), a novel federated learning framework explicitly tailored for MoE-based LLMs. FLEx efficiently personalizes by pruning the global MoE model to keep only one expert per client, and employs an adaptive gating mechanism to reintegrate these personalized experts into the pre-trained MoE layers, ensuring the original backbone architecture remains unchanged. These personalized experts are trained with local data and stored locally on each client, while the shared modules are aggregated globally. Extensive evaluations on diverse instruction-based datasets under non-IID conditions consistently demonstrate that FLEx outperforms existing federated baselines. Our code is available at https://anonymous.4open.science/r/FLEx-8F12.
中文: FLEx是一种新颖的联邦学习框架,通过仅聚合非专家参数并引入专家嫁接机制实现个性化模型适配,在保持世界知识的同时显著提升了异构数据下的性能表现。
English: FLEx is a novel federated learning framework that addresses data heterogeneity by aggregating only non-expert parameters and introducing expert grafting for personalized model adaptation, achieving superior performance while preserving world knowledge.
Authors:Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi
Abstract:
Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based LLMs for efficient personalization. By aggregating only the shared non-expert parameters, FLEx significantly reduces communication overhead and preserves the world knowledge stored within the frozen pretrained experts. For personalization, we introduce a novel expert grafting mechanism that leverages dynamic sparsity to construct a client-specific expert from selected components of pretrained experts, tailored to local data. This grafted expert is then fine-tuned locally alongside the gating mechanism. This joint training enables the model to learn when to leverage the shared knowledge from frozen experts and when to employ the personalized one. Evaluations on diverse, non-IID instruction tuning datasets show that FLEx consistently outperforms federated baselines on average, while demonstrating strong knowledge preservation on the knowledge-driven benchmark MMLU. Our code is available at \href{https://anonymous.4open.science/r/FLEx-8F12}{\texttt{https://anonymous.4open.science/r/FLEx-8F12}}.
中文: FLEx是一种新颖的联邦学习框架,通过仅聚合非专家参数并引入专家嫁接机制实现个性化模型适配,在保持世界知识的同时显著提升了异构数据下的性能表现。
English: FLEx is a novel federated learning framework that addresses data heterogeneity by aggregating only non-expert parameters and introducing expert grafting for personalized model adaptation, achieving superior performance while preserving world knowledge.
Authors:Jie Ren, Zhenwei Dai, Xianfeng Tang, Yue Xing, Shenglai Zeng, Hui Liu, Jingying Zeng, Qiankun Peng, Samarth Varshney, Suhang Wang, Qi He, Charu C. Aggarwal, Hui Liu
Abstract:
Although Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks, growing concerns have emerged over the misuse of sensitive, copyrighted, or harmful data during training. To address these concerns, unlearning techniques have been developed to remove the influence of specific data without retraining from scratch. However, this paper reveals a critical vulnerability in fine-tuning-based unlearning: a malicious user can craft a manipulated forgetting request that stealthily degrades the model's utility for benign users. We demonstrate this risk through a red-teaming Stealthy Attack (SA), which is inspired by two key limitations of existing unlearning (the inability to constrain the scope of unlearning effect and the failure to distinguish benign tokens from unlearning signals). Prior work has shown that unlearned models tend to memorize forgetting data as unlearning signals, and respond with hallucinations or feigned ignorance when unlearning signals appear in the input. By subtly increasing the presence of common benign tokens in the forgetting data, SA enhances the connection between benign tokens and unlearning signals. As a result, when normal users include such tokens in their prompts, the model exhibits unlearning behaviors, leading to unintended utility degradation. To address this vulnerability, we propose Scope-aware Unlearning (SU), a lightweight enhancement that introduces a scope term into the unlearning objective, encouraging the model to localize the forgetting effect. Our method requires no additional data processing, integrates seamlessly with existing fine-tuning frameworks, and significantly improves robustness against SA. Extensive experiments validate the effectiveness of both SA and SU.
中文: 本文揭示了基于微调的遗忘技术存在漏洞,恶意请求可通过隐蔽攻击降低模型效用,并提出无需额外处理的轻量级范围感知遗忘方法,将遗忘效应局部化以增强鲁棒性。
English: This paper exposes a vulnerability in fine-tuning-based unlearning where malicious requests can degrade model utility through a Stealthy Attack, and proposes Scope-aware Unlearning as a lightweight defense to localize forgetting effects without extra processing.
Authors:Michel Meintz, Jan DubiÅski, Franziska Boenisch, Adam Dziedzic
Abstract:
Image generative models have become increasingly popular, but training them requires large datasets that are costly to collect and curate. To circumvent these costs, some parties may exploit existing models by using the generated images as training data for their own models. In general, watermarking is a valuable tool for detecting unauthorized use of generated images. However, when these images are used to train a new model, watermarking can only enable detection if the watermark persists through training and remains identifiable in the outputs of the newly trained model - a property known as radioactivity. We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs). We find that existing watermarking methods for DMs fail to retain radioactivity, as watermarks are either erased during encoding into the latent space or lost in the noising-denoising process (during the training in the latent space). Meanwhile, despite IARs having recently surpassed DMs in image generation quality and efficiency, no radioactive watermarking methods have been proposed for them. To overcome this limitation, we propose the first watermarking method tailored for IARs and with radioactivity in mind - drawing inspiration from techniques in large language models (LLMs), which share IARs' autoregressive paradigm. Our extensive experimental evaluation highlights our method's effectiveness in preserving radioactivity within IARs, enabling robust provenance tracking, and preventing unauthorized use of their generated images.
中文摘要:本研究解决了图像生成模型中放射性水印的技术难题,发现扩散模型在训练过程中无法保留水印,同时借鉴大语言模型技术,首次为图像自回归模型提出了有效的放射性水印方法。
English Summary: This study addresses the challenge of radioactive watermarking in image generative models, revealing that diffusion models fail to retain watermarks during training while proposing the first effective radioactive watermarking method for image autoregressive models inspired by large language models.
Authors:Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren
Abstract:
Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.
中文: ContextCache是一种上下文感知的语义缓存系统,通过自注意力机制整合历史对话上下文,显著提升了多轮对话的精确率和召回率,同时将响应延迟降低至直接调用大语言模型的十分之一,大幅节省计算成本。
English: ContextCache is a context-aware semantic caching system that enhances multi-turn dialogue efficiency by integrating historical context through self-attention mechanisms, significantly improving precision and recall while reducing latency by 10 times compared to direct LLM calls.
Authors:Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
Abstract:
We propose a self-speaker adaptation method for streaming multi-talker automatic speech recognition (ASR) that eliminates the need for explicit speaker queries. Unlike conventional approaches requiring target speaker embeddings or enrollment audio, our technique dynamically adapts individual ASR instances through speaker-wise speech activity prediction. The key innovation involves injecting speaker-specific kernels generated via speaker supervision activations into selected ASR encoder layers. This enables instantaneous speaker adaptation to target speakers while handling fully overlapped speech even in a streaming scenario. Experiments show state-of-the-art performance in both offline and streaming scenarios, demonstrating that our self-adaptive method effectively addresses severe speech overlap through streamlined speaker-focused recognition. The results validate the proposed self-speaker adaptation approach as a robust solution for multi-talker ASR under severe overlapping speech conditions.
Chinese: 本文提出了一种用于流式多说话人语音识别的自说话人适配方法,通过说话人活动预测动态调整各识别实例,无需显式说话人查询或注册音频即可实现最先进性能。
English: This paper introduces a self-speaker adaptation method for streaming multi-talker ASR that dynamically adjusts individual recognition instances through speaker activity prediction, achieving state-of-the-art performance without requiring explicit speaker queries or enrollment data.
Authors:Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic
Abstract:
State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.
中文摘要:BitMark是为Infinity模型设计的鲁棒比特级水印框架,通过在图像生成过程中嵌入不可察觉的信号来防止模型崩溃,即使生成内容被用于训练后续模型也能保持可检测性。
English Summary: BitMark is a robust bitwise watermarking framework for Infinity models that embeds imperceptible signals during image generation to prevent model collapse by enabling detection of generated content, even when reused for training subsequent models.
Authors:Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu, Yu Yao, Peng Fu, Weidong Cai
Abstract:
Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. background) and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. These insights motivate a general conditional adaptation principle: by preserving task-relevant variations while filtering out nuisance shifts, one can achieve superior cross-domain generalization for counting. We provide both defining conditional divergence then proving its benefit in lowering joint error and a practical adaptation strategy that preserves task-relevant information in unsupervised domain-adaptive counting. We demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.
中文: 提出的条件特征对齐框架通过按条件子集对齐分布来解决物体计数中的领域适应挑战,理论上缩小了误差界限,并在多个密度分布数据集上实证优于现有方法。
English: The proposed conditional feature alignment framework addresses domain adaptation challenges in object counting by aligning distributions per condition subset, which theoretically tightens error bounds and empirically outperforms existing methods across diverse density datasets.
Authors:Hao Zhang, Shuo Shao, Song Li, Zhenyu Zhong, Yan Liu, Zhan Qin, Kui Ren
Abstract:
End-point monitoring solutions are widely deployed in today's enterprise environments to support advanced attack detection and investigation. These monitors continuously record system-level activities as audit logs and provide deep visibility into security events. Unfortunately, existing methods of semantic analysis based on audit logs have low granularity, only reaching the system call level, making it difficult to effectively classify highly covert behaviors. Additionally, existing works mainly match audit log streams with rule knowledge bases describing behaviors, which heavily rely on expertise and lack the ability to detect unknown attacks and provide interpretive descriptions. In this paper, we propose SmartGuard, an automated method that combines abstracted behaviors from audit event semantics with large language models. SmartGuard extracts specific behaviors (function level) from incoming system logs and constructs a knowledge graph, divides events by threads, and combines event summaries with graph embeddings to achieve information diagnosis and provide explanatory narratives through large language models. Our evaluation shows that SmartGuard achieves an average F1 score of 96\% in assessing malicious behaviors and demonstrates good scalability across multiple models and unknown attacks. It also possesses excellent fine-tuning capabilities, allowing experts to assist in timely system updates.
中文:SmartGuard是一种结合审计日志语义抽象行为与大语言模型的自动化安全方法,通过提取行为特征和构建知识图谱,能高效检测恶意活动并具备应对未知攻击的良好扩展性。
English: SmartGuard is an automated security method that integrates abstracted audit log behaviors with large language models, achieving high accuracy in detecting malicious activities and scalability against unknown attacks through behavior extraction and knowledge graph construction.
Authors:Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic
Abstract:
Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user's token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST (Privacy Of Soft prompt Transfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.
中文摘要:POST框架通过在小模型上私有调优软提示并将其迁移至大模型,有效解决了直接在大模型上调优带来的高计算成本和隐私泄露问题。
English Summary: POST is a framework that enables private and efficient tuning of soft prompts on a small model and transfers them to a larger LLM, addressing computational costs and privacy concerns associated with direct tuning on large models.
Authors:Jingyi Cui, Qi Zhang, Yifei Wang, Yisen Wang
Abstract:
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs). It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks. Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, through theoretical analysis, we for the first time propose the necessary and sufficient conditions for identifiable SAEs (SAEs that learn unique and ground truth monosemantic features), including 1) extreme sparsity of the ground truth feature, 2) sparse activation of SAEs, and 3) enough hidden dimensions of SAEs. Moreover, when the identifiable conditions are not fully met, we propose a reweighting strategy to improve the identifiability. Specifically, following the theoretically suggested weight selection principle, we prove that the gap between the loss functions of SAE reconstruction and monosemantic feature reconstruction can be narrowed, so that the reweighted SAEs have better reconstruction of the ground truth monosemantic features than the uniformly weighted ones. In experiments, we validate our theoretical findings and show that our weighted SAE significantly improves feature monosemanticity and interpretability.
Chinese: 本文首次提出了稀疏自编码器实现可识别特征恢复的充要条件,并在条件未完全满足时提出一种重加权策略,有效提升了真实单义特征的重建效果。
English: This paper establishes the necessary and sufficient conditions for sparse autoencoders to achieve identifiable feature recovery and proposes a reweighting strategy that enhances monosemantic feature reconstruction when these conditions are not fully met.
Authors:Chao Wang, Kai-Kit Wong, Zan Li, Liang Jin, Chan-Byoung Chae
Abstract:
The Fluid Antenna System (FAS), which enables flexible Multiple-Input Multiple-Output (MIMO) communications, introduces new spatial degrees of freedom for next-generation wireless networks. Unlike traditional MIMO, FAS involves joint port selection and precoder design, a combinatorial NP-hard optimization problem. Moreover, fully leveraging FAS requires acquiring Channel State Information (CSI) across its ports, a challenge exacerbated by the system's near-continuous reconfigurability. These factors make traditional system design methods impractical for FAS due to nonconvexity and prohibitive computational complexity. While deep learning (DL)-based approaches have been proposed for MIMO optimization, their limited generalization and fitting capabilities render them suboptimal for FAS. In contrast, Large Language Models (LLMs) extend DL's capabilities by offering general-purpose adaptability, reasoning, and few-shot learning, thereby overcoming the limitations of task-specific, data-intensive models. This article presents a vision for LLM-driven FAS design, proposing a novel flexible communication framework. To demonstrate the potential, we examine LLM-enhanced FAS in multiuser scenarios, showcasing how LLMs can revolutionize FAS optimization.
中文摘要:流体天线系统为无线网络带来新的空间自由度,但其组合优化难题和信道获取挑战可通过大语言模型的通用适应性与小样本学习能力实现突破性优化设计。
English Summary: The Fluid Antenna System (FAS) introduces new spatial degrees of freedom for wireless networks but faces NP-hard optimization challenges, which can be overcome by Large Language Models (LLMs) offering adaptable reasoning and few-shot learning capabilities for revolutionary FAS design.
Authors:Yimin Deng, Yuxia Wu, Yejing Wang, Guoshuai Zhao, Li Zhu, Qidong Liu, Derong Xu, Zichuan Fu, Xian Wu, Yefeng Zheng, Xiangyu Zhao, Xueming Qian
Abstract:
Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a Multi-Expert Structural-Semantic Hybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.
中文摘要:提出的MESH框架通过专业专家模块整合结构与语义推理,有效提升时序知识图谱预测能力,解决了现有方法的局限,并在多个数据集上验证了其优越性能。
English Summary: The proposed MESH framework integrates structural and semantic reasoning through specialized expert modules to enhance temporal knowledge graph prediction, addressing previous limitations and demonstrating strong performance across multiple datasets.
Authors:Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu
Abstract:
Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.
中文: 本文提出采用V-4D-A框架和Gaussian Action Field(GAF),通过运动感知的4D表示实现直接动作推理,在场景重建质量和机器人操作成功率上相比现有方法均取得显著提升。
English: This paper introduces a V-4D-A framework using Gaussian Action Field (GAF) that enables direct action reasoning from motion-aware 4D representations, significantly improving both scene reconstruction quality and robotic manipulation success rates over existing methods.
Authors:Hanjiang Hong, Kai-Kit Wong, Hao Xu, Xinghao Guo, Farshad Rostami Ghadi, Yu Chen, Yin Xu, Chan-Byoung Chae, Baiyang Liu, Kin-Fai Tong, Yangyang Zhang
Abstract:
The explosive growth of teletraffic, fueled by the convergence of cyber-physical systems and data-intensive applications, such as the Internet of Things (IoT), autonomous systems, and immersive communications, demands a multidisciplinary suite of innovative solutions across the physical and network layers. Fluid antenna systems (FAS) represent a transformative advancement in antenna design, offering enhanced spatial degrees of freedom through dynamic reconfigurability. By exploiting spatial flexibility, FAS can adapt to varying channel conditions and optimize wireless performance, making it a highly promising candidate for next-generation communication networks. This paper provides a comprehensive survey of the state of the art in FAS research. We begin by examining key application scenarios in which FAS offers significant advantages. We then present the fundamental principles of FAS, covering channel measurement and modeling, single-user configurations, and the multi-user fluid antenna multiple access (FAMA) framework. Following this, we delve into key network-layer techniques such as quality-of-service (QoS) provisioning, power allocation, and content placement strategies. We conclude by identifying prevailing challenges and outlining future research directions to support the continued development of FAS in next-generation wireless networks.
中文摘要:流体天线系统(FAS)通过动态可重构性提供变革性的无线性能,成为下一代网络的有力解决方案,本综述系统探讨了其原理、应用场景及未来研究方向。
English Summary: Fluid antenna systems (FAS) offer transformative wireless performance through dynamic reconfigurability, making them a promising solution for next-generation networks, as this comprehensive survey examines their principles, applications, and future challenges.
Authors:Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, Yebin Liu
Abstract:
Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.
Chinese Summary: SemanticSplat提出了一种前馈式方法,通过将3D高斯与语义属性相结合,利用特征融合和蒸馏框架从稀疏视图实现整体三维场景理解,有效克服了现有方法的局限性。
English Summary: SemanticSplat introduces a feed-forward method that unifies 3D Gaussians with semantic attributes to achieve holistic 3D scene understanding from sparse views, overcoming limitations in existing approaches through feature fusion and a distillation framework.
Authors:Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang
Abstract:
Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.
中文: DIVE方法通过在不同数据集上剪枝并重组前馈网络模块,增强了混合专家架构中专家多样性,以更少的训练成本实现高效重构,在相同激活参数下优于现有方法。
English: The DIVE method enhances expert diversity in MoE LLM reconstruction by pruning models on varied datasets and reassembling FFN modules, achieving efficient training with minimal accuracy loss compared to existing approaches.
Authors:Xinghao Guo, Yin Xu, Dazhi He, Cixiao Zhang, Hanjiang Hong, Kai-Kit Wong, Chan-Byoung Chae, Wenjun Zhang, Yiyan Wu
Abstract:
Fluid antenna (FA), as an emerging antenna technology, fully exploits spatial diversity. This paper integrates FA with the receive spatial modulation (RSM) scheme and proposes a novel FA-empowered RSM (FA-RSM) system. In this system, the transmitter is equipped with an FA that simultaneously activates multiple ports to transmit precoded signals. We address three key challenges in the FA-RSM system: port selection, theoretical analysis, and detection. First, for port selection, an optimal algorithm from a capacity maximization perspective are proposed, followed by two low-complexity alternatives. Second, for theoretical analysis, performance evaluation metrics are provided for port selection, which demonstrate that increasing the number of activated ports enhances system performance. Third, regarding detection, two low-complexity detectors are proposed. Simulation results confirm that the FA-RSM system significantly outperforms the conventional RSM system. The proposed low-complexity port selection algorithms facilitate minimal performance degradation. Moreover, while activating additional ports improves performance, the gain gradually saturates due to inherent spatial correlation, highlighting the importance of effective port selection in reducing system complexity and cost. Finally, both proposed detectors achieve near-optimal detection performance with low computational complexity, emphasizing the receiver-friendly nature of the FA-RSM system.
中文摘要:本文提出了一种流体天线赋能的接收空间调制(FA-RSM)系统,通过优化端口选择和低复杂度检测器设计,在保证性能的同时显著降低了系统复杂度,较传统RSM系统具有明显优势。
English Summary: This paper introduces a fluid antenna-enabled receive spatial modulation (FA-RSM) system that enhances performance through optimized port selection and low-complexity detectors, significantly outperforming conventional RSM systems.
Authors:Naibin Gu, Peng Fu, Xiyu Liu, Ke Ma, Zheng Lin, Weiping Wang
Abstract:
Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.
中文:Trans-PEFT是一种新颖方法,通过专注于注意力机制中的任务特定模式来增强参数高效微调模块,使其能在更新的基础模型上保持性能而无需重新调优,显著降低了计算成本。
English: Trans-PEFT is a novel method that enhances parameter-efficient fine-tuning modules by focusing on task-specific patterns in attention mechanisms, allowing them to maintain performance on updated base models without re-tuning and significantly reducing computational costs.
Authors:Hanjiang Hong, Kai-Kit Wong, Hao Xu, Yiyan Wu, Sai Xu, Chan-Byoung Chae, Baiyang Liu, Kin-Fai Tong
Abstract:
In-band full-duplex (IBFD) systems are expected to double the spectral efficiency compared to half-duplex systems, provided that loopback self-interference (SI) can be effectively suppressed. The inherent interference mitigation capabilities of the emerging fluid antenna system (FAS) technology make it a promising candidate for addressing the SI challenge in IBFD systems. This paper thus proposes a FAS-assisted self-interference cancellation (SIC) framework, which leverages a receiver-side FAS to dynamically select an interference-free port. Analytical results include a lower bound and an approximation of the residual SI (RSI) power, both derived for rich-scattering channels by considering the joint spatial correlation amongst the FAS ports. Simulations of RSI power and forward link rates validate the analysis, showing that the SIC performance improves with the number of FAS ports. Additionally, simulations under practical conditions, such as finite-scattering environments and wideband integrated access and backhaul (IAB) channels, reveal that the proposed approach offers superior SIC capability and significant forward rate gains over conventional IBFD SIC schemes.
中文摘要:本文提出了一种基于流体天线系统(FAS)的自干扰消除框架,通过动态选择无干扰端口来提升全双工系统性能,理论与仿真均表明随着端口数量增加,干扰消除能力和前向链路速率显著提升。
English Summary: The paper introduces a fluid antenna system (FAS)-assisted self-interference cancellation framework that dynamically selects interference-free ports to enhance full-duplex performance, with analytical and simulation results confirming improved cancellation and data rates as port count increases.
Authors:Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Ji Xiang, Weiping Wang
Abstract:
Knowledge editing aims to alternate the target knowledge predicted by large language models while ensuring the least side effects on unrelated knowledge. An effective way to achieve knowledge editing is to identify pivotal parameters for predicting factual associations and modify them with an optimization process to update the predictions. However, these locate-then-edit methods are uncontrollable since they tend to modify most unrelated relations connected to the subject of target editing. We unveil that this failure of controllable editing is due to a shortcut learning issue during the optimization process. Specifically, we discover two crucial features that are the subject feature and the relation feature for models to learn during optimization, but the current optimization process tends to over-learning the subject feature while neglecting the relation feature. To eliminate this shortcut learning of the subject feature, we propose a novel two-stage optimization process that balances the learning of the subject feature and the relation feature. Experimental results demonstrate that our approach successfully prevents knowledge editing from shortcut learning and achieves the optimal overall performance, contributing to controllable knowledge editing.
中文: 知识编辑通过一种新颖的两阶段优化过程,平衡了主体特征和关系特征的学习,有效避免了捷径学习问题,实现了最优的整体性能,从而促进了可控的知识编辑。
English: Knowledge editing in large language models is enhanced by a novel two-stage optimization process that balances subject and relation features, effectively preventing shortcut learning and achieving superior overall performance for controllable editing.
Authors:Mengxi Xiao, Mang Ye, Ben Liu, Xiaofen Zong, He Li, Jimin Huang, Qianqian Xie, Min Peng
Abstract:
The application of AI in psychiatric diagnosis faces significant challenges, including the subjective nature of mental health assessments, symptom overlap across disorders, and privacy constraints limiting data availability. To address these issues, we present MoodAngels, the first specialized multi-agent framework for mood disorder diagnosis. Our approach combines granular-scale analysis of clinical assessments with a structured verification process, enabling more accurate interpretation of complex psychiatric data. Complementing this framework, we introduce MoodSyn, an open-source dataset of 1,173 synthetic psychiatric cases that preserves clinical validity while ensuring patient privacy. Experimental results demonstrate that MoodAngels outperforms conventional methods, with our baseline agent achieving 12.3% higher accuracy than GPT-4o on real-world cases, and our full multi-agent system delivering further improvements. Evaluation in the MoodSyn dataset demonstrates exceptional fidelity, accurately reproducing both the core statistical patterns and complex relationships present in the original data while maintaining strong utility for machine learning applications. Together, these contributions provide both an advanced diagnostic tool and a critical research resource for computational psychiatry, bridging important gaps in AI-assisted mental health assessment.
中文摘要:MoodAngels提出首个专业多智能体框架和MoodSyn合成数据集,通过精细化临床评估与结构化验证提升情绪障碍诊断精度,在保护隐私的同时为计算精神病学提供先进工具与关键研究资源。
English Summary: MoodAngels introduces a specialized multi-agent framework and MoodSyn synthetic dataset to enhance AI-driven mood disorder diagnosis, achieving higher accuracy than conventional methods while addressing privacy and data limitations in computational psychiatry.
Authors:Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, Salem Lahlou, Veselin Stoyanov, Preslav Nakov
Abstract:
Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.
中文: FinChain是首个用于可验证思维链金融推理的符号基准,涵盖12个金融领域的54个主题,通过可执行的Python轨迹和新型评估指标ChainEval,揭示了当前大型语言模型在多步金融推理方面仍存在明显不足。
English: FinChain is the first symbolic benchmark for verifiable Chain-of-Thought reasoning in finance, featuring 54 topics across 12 domains with executable Python traces and a new evaluation metric called ChainEval, revealing significant gaps in current LLMs' multi-step financial reasoning capabilities.
Authors:Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun
Abstract:
Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.
中文: 近期模型如Gated DeltaNet和RWKV-7通过Delta学习规则优化循环记忆管理,本文提出Comba这一带反馈校正的双线性RNN变体,在语言和视觉任务中展现出卓越的性能与计算效率。
English: Recent models like Gated DeltaNet and RWKV-7 enhance recurrent memory management through Delta learning, and this paper introduces Comba, a bilinear RNN variant with feedback corrections that achieves superior efficiency in language and vision tasks.
Authors:Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun
Abstract:
Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.
中文: 近期模型如Gated DeltaNet和RWKV-7通过Delta学习规则优化循环记忆管理,本文提出Comba这一带反馈校正的双线性RNN变体,在语言和视觉任务中展现出卓越的性能与计算效率。
English: Recent models like Gated DeltaNet and RWKV-7 enhance recurrent memory management through Delta learning, and this paper introduces Comba, a bilinear RNN variant with feedback corrections that achieves superior efficiency in language and vision tasks.
Authors:Chenyu Wang, Zhou Yang, Yaniv Harel, David Lo
Abstract:
Code LLMs are increasingly employed in software development. However, studies have shown that they are vulnerable to backdoor attacks: when a trigger (a specific input pattern) appears in the input, the backdoor will be activated and cause the model to generate malicious outputs. Researchers have designed various triggers and demonstrated the feasibility of implanting backdoors by poisoning a fraction of the training data. Some basic conclusions have been made, such as backdoors becoming easier to implant when more training data are modified. However, existing research has not explored other factors influencing backdoor attacks on Code LLMs, such as training batch size, epoch number, and the broader design space for triggers, e.g., trigger length.
To bridge this gap, we use code summarization as an example to perform an empirical study that systematically investigates the factors affecting backdoor effectiveness and understands the extent of the threat posed. Three categories of factors are considered: data, model, and inference, revealing previously overlooked findings. We find that the prevailing consensus -- that attacks are ineffective at extremely low poisoning rates -- is incorrect. The absolute number of poisoned samples matters as well. Specifically, poisoning just 20 out of 454K samples (0.004\% poisoning rate -- far below the minimum setting of 0.1\% in prior studies) successfully implants backdoors! Moreover, the common defense is incapable of removing even a single poisoned sample from it. Additionally, small batch sizes increase the risk of backdoor attacks. We also uncover other critical factors such as trigger types, trigger length, and the rarity of tokens in the triggers, leading to valuable insights for assessing Code LLMs' vulnerability to backdoor attacks. Our study highlights the urgent need for defense mechanisms against extremely low poisoning rate settings.
中文: 代码大语言模型即使在极低投毒率下也易受后门攻击,批次大小和触发器设计等因素显著影响攻击效果,揭示了当前防御机制无法应对的关键安全漏洞。
English: Code LLMs are vulnerable to backdoor attacks even at extremely low poisoning rates, with factors like batch size and trigger design significantly influencing attack effectiveness, revealing critical security gaps that current defenses fail to address.
Authors:Chenyu Wang, Zhou Yang, Yaniv Harel, David Lo
Abstract:
Code LLMs are increasingly employed in software development. However, studies have shown that they are vulnerable to backdoor attacks: when a trigger (a specific input pattern) appears in the input, the backdoor will be activated and cause the model to generate malicious outputs. Researchers have designed various triggers and demonstrated the feasibility of implanting backdoors by poisoning a fraction of the training data. Some basic conclusions have been made, such as backdoors becoming easier to implant when more training data is modified. However, existing research has not explored other factors influencing backdoor attacks on Code LLMs, such as training batch size, epoch number, and the broader design space for triggers, e.g., trigger length. To bridge this gap, we use code summarization as an example to perform an empirical study that systematically investigates the factors affecting backdoor effectiveness and understands the extent of the threat posed. Three categories of factors are considered: data, model, and inference, revealing previously overlooked findings. We find that the prevailing consensus -- that attacks are ineffective at extremely low poisoning rates -- is incorrect. The absolute number of poisoned samples matters as well. Specifically, poisoning just 20 out of 454K samples (0.004% poisoning rate -- far below the minimum setting of 0.1% in prior studies) successfully implants backdoors! Moreover, the common defense is incapable of removing even a single poisoned sample from it. Additionally, small batch sizes increase the risk of backdoor attacks. We also uncover other critical factors such as trigger types, trigger length, and the rarity of tokens in the triggers, leading to valuable insights for assessing Code LLMs' vulnerability to backdoor attacks. Our study highlights the urgent need for defense mechanisms against extremely low poisoning rate settings.
中文: 代码大语言模型即使在极低投毒率下也易受后门攻击,批次大小和触发器设计等因素显著影响攻击效果,揭示了当前防御机制无法应对的关键安全漏洞。
English: Code LLMs are vulnerable to backdoor attacks even at extremely low poisoning rates, with factors like batch size and trigger design significantly influencing attack effectiveness, revealing critical security gaps that current defenses fail to address.
Authors:Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
Abstract:
Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig
中文: 提出的HOSIG框架通过分层整合场景感知抓取生成、启发式导航和运动扩散,克服现有方法在协调场景导航与精细操作方面的不足,实现了无碰撞、高精度全身运动合成,在实验中展现出优越性能且需极少人工干预。
English: The proposed HOSIG framework synthesizes full-body human interactions by hierarchically integrating scene-aware grasp generation, heuristic navigation, and motion diffusion to overcome limitations in existing methods, achieving superior performance in generating collision-free, dexterous motions with minimal manual intervention.
Authors:Xingguang Zhong, Yue Pan, Liren Jin, Marija PopoviÄ, Jens Behley, Cyrill Stachniss
Abstract:
Recently, 3D Gaussian splatting-based RGB-D SLAM displays remarkable performance of high-fidelity 3D reconstruction. However, the lack of depth rendering consistency and efficient loop closure limits the quality of its geometric reconstructions and its ability to perform globally consistent mapping online. In this paper, we present 2DGS-SLAM, an RGB-D SLAM system using 2D Gaussian splatting as the map representation. By leveraging the depth-consistent rendering property of the 2D variant, we propose an accurate camera pose optimization method and achieve geometrically accurate 3D reconstruction. In addition, we implement efficient loop detection and camera relocalization by leveraging MASt3R, a 3D foundation model, and achieve efficient map updates by maintaining a local active map. Experiments show that our 2DGS-SLAM approach achieves superior tracking accuracy, higher surface reconstruction quality, and more consistent global map reconstruction compared to existing rendering-based SLAM methods, while maintaining high-fidelity image rendering and improved computational efficiency.
中文:2DGS-SLAM系统采用二维高斯泼溅作为地图表示方法,通过深度一致渲染和闭环检测功能,实现了几何精确的三维重建和高效的全局建图。
English: The 2DGS-SLAM system utilizes 2D Gaussian splatting for map representation, achieving geometrically accurate 3D reconstruction and efficient global mapping through depth-consistent rendering and loop closure capabilities.
Authors:Linhao Ye, Lang Yu, Zhikai Lei, Qin Chen, Jie Zhou, Liang He
Abstract:
Retrieval-augmented generation (RAG) is usually integrated into large language models (LLMs) to mitigate hallucinations and knowledge obsolescence. Whereas,conventional one-step retrieve-and-read methods are insufficient for multi-hop question answering, facing challenges of retrieval semantic mismatching and the high cost in handling interdependent subquestions. In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM). Q-DREAM consists of three key modules: (1) the Question Decomposition Module (QDM), which decomposes multi-hop questions into fine-grained subquestions; (2) the Subquestion Dependency Optimizer Module (SDOM), which models the interdependent relations of subquestions for better understanding; and (3) the Dynamic Passage Retrieval Module (DPRM), which aligns subquestions with relevant passages by optimizing the semantic embeddings. Experimental results across various benchmarks demonstrate that Q-DREAM significantly outperforms existing RAG methods, achieving state-of-the-art performance in both in-domain and out-of-domain settings. Notably, Q-DREAM also improves retrieval efficiency while maintaining high accuracy compared with recent baselines.
Chinese: 本文提出Q-DREAM框架,通过将多跳问题分解为相互依赖的子问题、优化语义嵌入并与相关段落对齐,显著提升了动态检索增强生成在多跳问答中的性能与效率,在多个基准测试中表现卓越。
English: The paper introduces Q-DREAM, a dynamic retrieval-augmented generation framework that enhances multi-hop question answering by decomposing questions into interdependent subquestions, optimizing their semantic embeddings, and aligning them with relevant passages, achieving superior performance and efficiency across benchmarks.
Authors:Qihui Fan, Wenbo Li, Enfu Nan, Yixiao Chen, Lei Lu, Pu Zhao, Yanzhi Wang
Abstract:
The growing popularity of social deduction games has created an increasing need for intelligent frameworks where humans can collaborate with AI agents, particularly in post-pandemic contexts with heightened psychological and social pressures. Social deduction games like Werewolf, traditionally played through verbal communication, present an ideal application for Large Language Models (LLMs) given their advanced reasoning and conversational capabilities. Prior studies have shown that LLMs can outperform humans in Werewolf games, but their reliance on external modules introduces latency that left their contribution in academic domain only, and omit such game should be user-facing. We propose \textbf{Verbal Werewolf}, a novel LLM-based Werewolf game system that optimizes two parallel pipelines: gameplay powered by state-of-the-art LLMs and a fine-tuned Text-to-Speech (TTS) module that brings text output to life. Our system operates in near real-time without external decision-making modules, leveraging the enhanced reasoning capabilities of modern LLMs like DeepSeek V3 to create a more engaging and anthropomorphic gaming experience that significantly improves user engagement compared to existing text-only frameworks.
中文:提出的"言语狼人杀"系统采用先进大语言模型和优化语音合成技术,构建了近乎实时的社交推理游戏体验,通过消除外部模块延迟显著提升了用户参与度,突破了纯文本交互的局限。
English: The proposed Verbal Werewolf system utilizes advanced LLMs and fine-tuned TTS to create a near real-time, engaging social deduction game that eliminates latency from external modules and enhances user interaction beyond text-only interfaces.
Authors:Rock Yuren Pang, K. J. Kevin Feng, Shangbin Feng, Chu Li, Weijia Shi, Yulia Tsvetkov, Jeffrey Heer, Katharina Reinecke
Abstract:
The output quality of large language models (LLMs) can be improved via "reasoning": generating segments of chain-of-thought (CoT) content to further condition the model prior to producing user-facing output. While these chains contain valuable information, they are verbose and lack explicit organization, making them tedious to review. Moreover, they lack opportunities for user feedback, such as to remove unwanted considerations, add desired ones, or clarify unclear assumptions. We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics and enables user review and modification. We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs. In a user study with 16 participants, we find that interactive reasoning in Hippo allows users to quickly identify and interrupt erroneous generations, efficiently steer the model towards customized responses, and better understand both model reasoning and model outputs. Our work contributes to a new paradigm that incorporates user oversight into LLM reasoning processes.
中文:交互式推理通过将思维链输出可视化为层次结构,使用户能够审查、修改并引导模型响应,从而增强大型语言模型的用户监督和定制能力。
English: Interactive Reasoning enhances large language models by visualizing chain-of-thought outputs as a hierarchical structure, enabling users to review, modify, and steer model responses for improved oversight and customization.
Authors:Junxuan Yu, Yaofei Duan, Yuhao Huang, Yu Wang, Rongbo Ling, Weihao Luo, Ang Zhang, Jingxian Xu, Qiongying Ni, Yongsong Zhou, Binghan Li, Haoran Dou, Liping Liu, Yanfen Chu, Feng Geng, Zhe Sheng, Zhifeng Ding, Dingxin Zhang, Rui Huang, Yuhang Zhang, Xiaowei Xu, Tao Tan, Dong Ni, Zhongshan Gou, Xin Yang
Abstract:
Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quantification. However, it remains challenging due to the rare paired data, complex structures, and US noises. In this study, we introduce a novel generative framework UltraTwin, to obtain cardiac anatomical twin from sparse multi-view 2D US. Our contribution is three-fold. First, pioneered the construction of a real-world and high-quality dataset containing strictly paired multi-view 2D US and CT, and pseudo-paired data. Second, we propose a coarse-to-fine scheme to achieve hierarchical reconstruction optimization. Last, we introduce an implicit autoencoder for topology-aware constraints. Extensive experiments show that UltraTwin reconstructs high-quality anatomical twins versus strong competitors. We believe it advances anatomical twin modeling for potential applications in personalized cardiac care.
中文: 本研究提出UltraTwin生成框架,通过创新的由粗到精重建方法和拓扑感知约束,从稀疏二维超声图像构建高质量三维心脏解剖孪生体,在临床应用展现出卓越性能。
English: This study introduces UltraTwin, a generative framework that constructs high-quality 3D cardiac anatomical twins from sparse 2D ultrasound images using a novel coarse-to-fine reconstruction approach and topology-aware constraints, demonstrating superior performance in clinical applications.
Authors:Zhiyuan Zhu, Jian Wang, Yong Jiang, Tong Han, Yuhao Huang, Ang Zhang, Kaiwen Yang, Mingyuan Luo, Zhe Liu, Yaofei Duan, Dong Ni, Tianhong Tang, Xin Yang
Abstract:
Accurate carotid plaque grading (CPG) is vital to assess the risk of cardiovascular and cerebrovascular diseases. Due to the small size and high intra-class variability of plaque, CPG is commonly evaluated using a combination of transverse and longitudinal ultrasound views in clinical practice. However, most existing deep learning-based multi-view classification methods focus on feature fusion across different views, neglecting the importance of representation learning and the difference in class features. To address these issues, we propose a novel Corpus-View-Category Refinement Framework (CVC-RF) that processes information from Corpus-, View-, and Category-levels, enhancing model performance. Our contribution is four-fold. First, to the best of our knowledge, we are the foremost deep learning-based method for CPG according to the latest Carotid Plaque-RADS guidelines. Second, we propose a novel center-memory contrastive loss, which enhances the network's global modeling capability by comparing with representative cluster centers and diverse negative samples at the Corpus level. Third, we design a cascaded down-sampling attention module to fuse multi-scale information and achieve implicit feature interaction at the View level. Finally, a parameter-free mixture-of-experts weighting strategy is introduced to leverage class clustering knowledge to weight different experts, enabling feature decoupling at the Category level. Experimental results indicate that CVC-RF effectively models global features via multi-level refinement, achieving state-of-the-art performance in the challenging CPG task.
中文: 本研究提出的语料-视图-类别精炼框架(CVC-RF)通过多层级特征优化与创新性对比学习机制,显著提升了颈动脉斑块分级的准确率,实现了最优性能表现。
English: The proposed Corpus-View-Category Refinement Framework (CVC-RF) enhances carotid plaque grading by integrating multi-level feature refinement through novel contrastive learning and attention mechanisms, achieving state-of-the-art performance.
Authors:Weicong Chen, Jiajia Guo, Yiming Cui, Xiao Li, Shi Jin
Abstract:
Channel state information (CSI) is essential to unlock the potential of reconfigurable intelligent surfaces (RISs) in wireless communication systems. Since massive RIS elements are typically implemented without baseband signal processing capabilities, limited CSI feedback is necessary when designing the reflection/refraction coefficients of the RIS. In this article, the unique RIS-assisted channel features, such as the RIS position-dependent channel fluctuation, the ultra-high dimensional sub-channel matrix, and the structured sparsity, are distilled from recent advances in limited feedback and used as guidelines for designing feedback schemes. We begin by illustrating the use cases and the corresponding challenges associated with RIS feedback. We then discuss how to leverage techniques such as channel customization, structured-sparsity, autoencoders, and others to reduce feedback overhead and complexity when devising feedback schemes. Finally, we identify potential research directions by considering the unresolved challenges, the new RIS architecture, and the integration with multi-modal information and artificial intelligence.
中文摘要:本文探讨了如何利用独特的信道特性和先进技术,通过有限的信道状态信息反馈来优化无线系统中可重构智能表面的性能,从而降低反馈开销和复杂度。
English Summary: This article explores the use of limited channel state information feedback to optimize reconfigurable intelligent surfaces in wireless systems by leveraging unique channel features and advanced techniques to reduce overhead and complexity.
Authors:Yu Ma, Xingyu Zhou, Xiao Li, Le Liang, Shi Jin
Abstract:
Reconfigurable intelligent surfaces (RIS) are key enablers for 6G wireless systems. This paper studies downlink transmission in an RIS-assisted MISO-OFDMA system, addressing resource allocation challenges. A two-stage unsupervised learning-based framework is proposed to jointly design RIS phase shifts, BS beamforming, and resource block (RB) allocation. The framework includes BeamNet, which predicts RIS phase shifts from CSI, and AllocationNet, which allocates RBs using equivalent CSI derived from BeamNet outputs. Active beamforming is implemented via maximum ratio transmission and water-filling. To handle discrete constraints while ensuring differentiability, quantization and the Gumbel-softmax trick are adopted. A customized loss and phased training enhance performance under QoS constraints. Simulations show the method achieves 99.93% of the sum rate of the SCA baseline with only 0.036% of its runtime, and it remains robust across varying channel and user conditions.
中文: 本文提出了一种基于无监督学习的双阶段框架,通过联合优化智能反射面相位、基站波束成形和资源分配,在保证服务质量的同时,以极低计算成本实现了接近最优的系统性能。
English: This paper introduces a two-stage unsupervised learning framework that efficiently optimizes RIS phase shifts, beamforming, and resource allocation for 6G MISO-OFDMA systems, achieving near-optimal performance with drastically reduced computational time.
Authors:Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu
Abstract:
Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequences, and generalisability across scanning protocols. The TUS-REC2024 Challenge was established to benchmark and accelerate progress in trackerless 3D ultrasound reconstruction by providing a publicly available dataset for the first time, along with a baseline model and evaluation framework. The Challenge attracted over 43 registered teams, of which 6 teams submitted 21 valid dockerized solutions. Submitted methods spanned a wide range of algorithmic approaches, including recurrent models, registration-driven volume refinement, attention, and physics-informed models. This paper presents an overview of the Challenge design, summarises the key characteristics of the dataset, provides a concise literature review, introduces the technical details of the underlying methodology working with tracked freehand ultrasound data, and offers a comparative analysis of submitted methods across multiple evaluation metrics. The results highlight both the progress and current limitations of state-of-the-art approaches in this domain, and inform directions for future research. The data, evaluation code, and baseline are publicly available to facilitate ongoing development and reproducibility. As a live and evolving benchmark, this Challenge is designed to be continuously developed and improved. The Challenge was held at MICCAI 2024 and will be organised again at MICCAI 2025, reflecting its growing impact and the sustained commitment to advancing this field.
中文: TUS-REC2024挑战赛通过首次公开数据集和评估框架推动无跟踪三维超声重建技术发展,吸引了多种算法方案参与,既展示了该便携成像技术的进展,也揭示了当前局限。
English: The TUS-REC2024 Challenge advances trackerless 3D ultrasound reconstruction by providing the first public dataset and evaluation framework, attracting diverse algorithmic submissions that reveal both progress and limitations in this portable imaging technology.
Authors:Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
Abstract:
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
中文摘要:本研究提出特征单义性评分(FMS)和引导稀疏自编码器(G-SAE),通过标注概念约束潜在表征,有效提升了特征定位与解耦能力,在毒性检测、写作风格识别等任务中实现了更精准的模型调控与解释性,同时降低性能损耗。
English Summary: The study introduces the Feature Monosemanticity Score (FMS) and Guided Sparse Autoencoders (G-SAE) to address limitations in feature isolation and monosemanticity, demonstrating improved interpretability and control in tasks like toxicity detection and style identification with reduced quality degradation.
Authors:Weilun Yu, Shixiang Tang, Yonggui Huang, Nanqing Dong, Li Fan, Honggang Qi, Wei Liu, Xiaoli Diao, Xi Chen, Wanli Ouyang
Abstract:
Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCIentists), a multi-agent framework built on LLMs that incorporates two key innovations: a Dynamic Knowledge Exchange mechanism enabling iterative feedback among agents, and a Dual-Diversity Review paradigm that simulates heterogeneous expert evaluation. These components jointly promote deeper reasoning and the generation of more creative and impactful scientific ideas. To evaluate the effectiveness and generalizability of our approach, we conduct experiments on two datasets: a widely used benchmark in computer science and a new dataset we introduce in the health sciences domain. Results show that IDVSCI consistently achieves the best performance across both datasets, outperforming existing systems such as AI Scientist and VIRSCI. These findings highlight the value of modeling interaction and peer review dynamics in LLM-based autonomous research.
中文摘要:IDVSCI框架通过动态知识交换和双重多样性评审机制,在计算机科学与健康科学领域均展现出优于现有系统的性能,推进了基于大语言模型的自主科研发展。
English Summary: The IDVSCI framework enhances LLM-based scientific discovery by integrating dynamic knowledge exchange and dual-diversity review, achieving superior performance across computer science and health science benchmarks.
Authors:Yiwei Yang, Chung Peng Lee, Shangbin Feng, Dora Zhao, Bingbing Wen, Anthony Z. Liu, Yulia Tsvetkov, Bill Howe
Abstract:
Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid "shortcuts" and attend to the overall image context.
中文摘要:微调可能导致模型学习虚假关联,而新型SpuriVerse基准测试表明,即便是先进的多模态视觉语言模型在真实世界虚假关联任务中也表现不佳,最高准确率仅37.1%,但通过针对性训练可使准确率提升至78.40%,引导模型学会关注整体图像语境而非依赖捷径。
English Summary: Fine-tuning can create misleading shortcuts in models, but the new SpuriVerse benchmark reveals that even advanced multi-modal vision-language models struggle with real-world spurious correlations, achieving only 37.1% accuracy, though targeted training can improve performance to 78.40% by teaching models to focus on comprehensive image context.
Authors:Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan SerrÃ, Marco A. MartÃnez-RamÃrez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji
Abstract:
This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific model. This is crucial in the context of AI-generated music, where proper recognition and credit for original artists are generally overlooked. By enabling white-box attribution, our work supports a fairer system for acknowledging artistic contributions and addresses pressing concerns related to AI ethics and copyright. We apply unlearning-based attribution to a text-to-music diffusion model trained on a large-scale dataset and investigate its feasibility and behavior in this setting. To validate the method, we perform a grid search over different hyperparameter configurations and quantitatively evaluate the consistency of the unlearning approach. We then compare attribution patterns from unlearning with those from a similarity-based approach. Our findings suggest that unlearning-based approaches can be effectively adapted to music generative models, introducing large-scale TDA to this domain and paving the way for more ethical and accountable AI systems for music creation.
本文探讨了在音乐生成模型中应用遗忘方法进行训练数据归因,以识别关键训练数据贡献,从而支持更公平的艺术家认可机制并解决AI伦理和版权问题。
This paper introduces unlearning methods for training data attribution in music generative models to identify key training data contributions, enabling fairer artist recognition and addressing AI ethics and copyright concerns in music creation.
Authors:Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji
Abstract:
This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed the most to the generation of a particular output from a specific model. This is crucial in the context of AI-generated music, where proper recognition and credit for original artists are generally overlooked. By enabling white-box attribution, our work supports a fairer system for acknowledging artistic contributions and addresses pressing concerns related to AI ethics and copyright. We apply unlearning-based attribution to a text-to-music diffusion model trained on a large-scale dataset and investigate its feasibility and behavior in this setting. To validate the method, we perform a grid search over different hyperparameter configurations and quantitatively evaluate the consistency of the unlearning approach. We then compare attribution patterns from unlearning with non-counterfactual approaches. Our findings suggest that unlearning-based approaches can be effectively adapted to music generative models, introducing large-scale TDA to this domain and paving the way for more ethical and accountable AI systems for music creation.
本文探讨了在音乐生成模型中应用遗忘方法进行训练数据归因,以识别关键训练数据贡献,从而支持更公平的艺术家认可机制并解决AI伦理和版权问题。
This paper introduces unlearning methods for training data attribution in music generative models to identify key training data contributions, enabling fairer artist recognition and addressing AI ethics and copyright concerns in music creation.
Authors:Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayaraman, Glen Berseth, Kostas Daniilidis, Roberto Martin-Martin, Youngwoon Lee, Percy Liang, Chelsea Finn, Sergey Levine
Abstract:
Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
中文摘要:RoboArena提出了一种众包分布式评估方法,通过在不同真实环境任务中进行双盲配对比较来对通用机器人策略进行排名,相比传统标准化评估更具可扩展性和准确性。
English Summary: RoboArena introduces a crowd-sourced, distributed evaluation network that uses double-blind pairwise comparisons across diverse real-world tasks to rank generalist robot policies more accurately and scalably than standardized approaches.
Authors:Yixuan Wu, Yang Zhang, Jian Wu, Philip Torr, Jindong Gu
Abstract:
Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering. However, they often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information. To address these issues, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs. Our framework incorporates a multimodal grounding module for both visual grounding, which identifies the referred object in the image, and textual grounding, which generates the rationale for the final answer, ensuring that outputs are anchored in both visual and textual evidence. To mitigate the hallucinations, we introduce a negative rejection mechanism in the visual grounding module to distinguish grounded entities from non-existent objects influenced by linguistic biases. On the textual grounding side, we propose a selective reasoning mechanism that adjusts the model's reasoning strategy based on query complexity. Extensive evaluations are conducted on benchmarks such as POPE, HaloQuest, VQAv2, MME, and MMBench showing significant improvements in fine-grained visual understanding and hallucination suppression.
中文摘要:MMGrounded-PostAlign框架通过多模态接地机制和选择性推理策略,有效提升多模态大语言模型的视觉理解能力并抑制幻觉现象,在多项基准测试中表现显著提升。
English Summary: The MMGrounded-PostAlign framework enhances multimodal models' visual understanding and reduces hallucinations through multimodal grounding and selective reasoning mechanisms, demonstrating significant improvements across multiple benchmarks.
Authors:Song Wang, Zhen Tan, Zihan Chen, Shuang Zhou, Tianlong Chen, Jundong Li
Abstract:
Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
中文摘要:本研究提出一种基于序列结构的多智能体协作框架,通过下一智能体预测和上下文选择机制增强通信灵活性与效率,在多个基准测试中显著优于现有方法并降低通信成本。
English Summary: This study introduces a sequential framework for multi-agent collaboration that enhances adaptability and communication efficiency through next-agent prediction and context selection, outperforming existing methods while reducing overhead.
Authors:Junghyun Koo, Marco A. MartÃnez-RamÃrez, Wei-Hsiang Liao, Giorgio Fabbro, Michele Mancusi, Yuki Mitsufuji
Abstract:
Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to fine-tune the results to match their artistic intent. In this paper, we introduce the ITO-Master framework, a reference-based mastering style transfer system that integrates Inference-Time Optimization (ITO) to enable finer user control over the mastering process. By optimizing the reference embedding during inference, our approach allows users to refine the output dynamically, making micro-level adjustments to achieve more precise mastering results. We explore both black-box and white-box methods for modeling mastering processors and demonstrate that ITO improves mastering performance across different styles. Through objective evaluation, subjective listening tests, and qualitative analysis using text-based conditioning with CLAP embeddings, we validate that ITO enhances mastering style similarity while offering increased adaptability. Our framework provides an effective and user-controllable solution for mastering style transfer, allowing users to refine their results beyond the initial style transfer.
中文: ITO-Master框架通过推理时优化技术,使用户能够动态调控音乐母带风格转换过程,实现超越固定参考处理的微调能力,从而获得更精准和适应性更强的母带效果。
English: The ITO-Master framework introduces inference-time optimization to enable dynamic user control over music mastering style transfer, allowing micro-adjustments for more precise results and enhanced adaptability beyond fixed reference processing.
Authors:Manuel Brack, Sudeep Katakol, Felix Friedrich, Patrick Schramowski, Hareesh Ravi, Kristian Kersting, Ajinkya Kale
Abstract:
Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.
中文: 本研究揭示,密集高质量合成标注能提升文生图模型的文本对齐性,但可能削弱美学质量与多样性,而随机长度标注可在保持多样性的同时实现均衡提升,凸显了标注设计对模型性能的关键影响。
English: This study reveals that dense, high-quality synthetic captions improve text alignment in text-to-image models but may reduce aesthetic quality and diversity, while randomized-length captions achieve balanced enhancements without sacrificing diversity, highlighting the critical role of caption design in model performance.
Authors:Mingyuan Luo, Xin Yang, Zhongnuo Yan, Yan Cao, Yuanji Zhang, Xindi Hu, Jin Wang, Haoxuan Ding, Wei Han, Litao Sun, Dong Ni
Abstract:
Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties in reducing cumulative drift and further improving reconstruction accuracy, particularly in scenarios involving complex motion trajectories. In this context, we propose an enhanced motion network (MoNetV2) to enhance the accuracy and generalizability of reconstruction under diverse scanning velocities and tactics. First, we propose a sensor-based temporal and multi-branch structure that fuses image and motion information from a velocity perspective to improve image-only reconstruction accuracy. Second, we devise an online multi-level consistency constraint that exploits the inherent consistency of scans to handle various scanning velocities and tactics. This constraint exploits both scan-level velocity consistency, path-level appearance consistency, and patch-level motion consistency to supervise inter-frame transformation estimation. Third, we distill an online multi-modal self-supervised strategy that leverages the correlation between network estimation and motion information to further reduce cumulative errors. Extensive experiments clearly demonstrate that MoNetV2 surpasses existing methods in both reconstruction quality and generalizability performance across three large datasets.
中文: 提出的MoNetV2通过多分支融合和多重一致性约束将运动数据与图像信息结合,显著提升了三维超声重建在不同扫描条件下的精度和适应性。
English: The proposed MoNetV2 enhances 3D ultrasound reconstruction by integrating motion data with images through multi-branch fusion and multi-level consistency constraints, significantly improving accuracy and adaptability across varied scanning conditions.
Authors:Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting
Abstract:
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
中文摘要:SLR是一个自动化框架,通过合成提示和验证程序来训练和评估大语言模型的逻辑推理能力,无需人工标注即可高效提升模型性能。
English Summary: SLR is an automated framework that synthesizes prompts and validation programs for training and evaluating LLMs on logical reasoning tasks, improving their performance efficiently without human input.
Authors:Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou, Cailian Chen, Zongwei Wu, Radu Timofte, Mingjia Li, Jin Hu, Hainuo Wang, Hengxing Liu, Jiarui Wang, Qiming Hu, Xiaojie Guo, Xin Lu, Jiarong Yang, Yuanfei Bao, Anya Hu, Zihao Fan, Kunyu Wang, Jie Xiao, Xi Wang, Xueyang Fu, Zheng-Jun Zha, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu, Xingbo Wang, Dong Li, Yuxu Chen, Bin Chen, Yuanbo Zhou, Yuanbin Chen, Hongwei Wang, Jiannan Lin, Qinquan Gao, Tong Tong, Zhao Zhang, Yanyan Wei, Wei Dong, Han Zhou, Seyed Amirreza Mousavi, Jun Chen, Haobo Liang, Jiajie Jing, Junyu Li, Yan Yang, Seoyeon Lee, Chaewon Kim, Ziyu Feng, Shidi Chen, Bowen Luan, Zewen Chen, Vijayalaxmi Ashok Aralikatti, G Gyaneshwar Rao, Nikhil Akalwadi, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Anas M. Ali, Bilel Benjdira, Wadii Boulila, Alexandru Brateanu, Cosmin Ancuti, Tanmay Chaturvedi, Manish Kumar, Anmol Srivastav, Daksh Trivedi, Shashwat Thakur, Kishor Upla, Zeyu Xiao, Zhuoyuan Li, Boda Zhou, Shashank Shekhar, Kele Xu, Qisheng Xu, Zijian Gao, Tianjiao Wan, Suiyi Zhao, Bo Wang, Yan Luo, Mingshen Wang, Yilin Zhang
Abstract:
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
中文: NTIRE 2025阴影去除挑战赛吸引了306名参赛者和17支决赛队伍,通过保真度和视觉感知双赛道,在包含丰富物体交互的WSRD+数据集上进行了全面评估。
English: The NTIRE 2025 Shadow Removal Challenge engaged 306 registrants and 17 finalist teams, evaluating solutions through fidelity and user perception tracks using the complex WSRD+ dataset.
Authors:Siwei Tu, Jingyi Xu, Weidong Yang, Lei Bai, Ben Fei
Abstract:
Accurate acquisition of high-resolution surface meteorological conditions is critical for forecasting and simulating meteorological variables. Directly applying spatial interpolation methods to derive meteorological values at specific locations from low-resolution grid fields often yields results that deviate significantly from the actual conditions. Existing downscaling methods primarily rely on the coupling relationship between geostationary satellites and ERA5 variables as a condition. However, using brightness temperature data from geostationary satellites alone fails to comprehensively capture all the changes in meteorological variables in ERA5 maps. To address this limitation, we can use a wider range of satellite data to make more full use of its inversion effects on various meteorological variables, thus producing more realistic results across different meteorological variables. To further improve the accuracy of downscaling meteorological variables at any location, we propose the Multi-source Observation Down-Scaling Model (MODS). It is a conditional diffusion model that fuses data from multiple geostationary satellites GridSat, polar-orbiting satellites (AMSU-A, HIRS, and MHS), and topographic data (GEBCO), as conditions, and is pre-trained on the ERA5 reanalysis dataset. During training, latent features from diverse conditional inputs are extracted separately and fused into ERA5 maps via a multi-source cross-attention module. By exploiting the inversion relationships between reanalysis data and multi-source atmospheric variables, MODS generates atmospheric states that align more closely with real-world conditions. During sampling, MODS enhances downscaling consistency by incorporating low-resolution ERA5 maps and station-level meteorological data as guidance. Experimental results demonstrate that MODS achieves higher fidelity when downscaling ERA5 maps to a 6.25 km resolution.
中文摘要:多源观测降尺度模型(MODS)作为一种条件扩散模型,融合多卫星和地形数据,显著提升了气象变量降尺度的精度,在6.25公里分辨率下实现了比现有方法更优的模拟效果。
English Summary: The Multi-source Observation Down-Scaling Model (MODS) is introduced as a conditional diffusion model that integrates multi-satellite and topographic data to enhance the accuracy of downscaling meteorological variables, achieving superior results at 6.25 km resolution compared to existing methods.
Authors:Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He
Abstract:
Large language models (LLMs) have shown strong performance across natural language tasks, but remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying parameters to map specific triggers to attacker-desired responses. However, these methods often suffer from safety fallback, where the model initially responds affirmatively but later reverts to refusals due to safety alignment. In this work, we propose DualEdit, a dual-objective model editing framework that jointly promotes affirmative outputs and suppresses refusal responses. To address two key challenges -- balancing the trade-off between affirmative promotion and refusal suppression, and handling the diversity of refusal expressions -- DualEdit introduces two complementary techniques. (1) Dynamic loss weighting calibrates the objective scale based on the pre-edited model to stabilize optimization. (2) Refusal value anchoring compresses the suppression target space by clustering representative refusal value vectors, reducing optimization conflict from overly diverse token sets. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 9.98\% and reduces safety fallback rate by 10.88\% over baselines.
中文: DualEdit是一种双目标模型编辑框架,通过动态损失加权和拒绝值锚定技术,在促进肯定性输出的同时抑制拒绝响应,显著提高了后门攻击成功率并降低了安全回退率。
English: DualEdit is a dual-objective model editing framework that enhances backdoor attack success by promoting affirmative responses and suppressing refusals through dynamic loss weighting and refusal value anchoring, achieving significant improvements in attack effectiveness and safety fallback reduction.
Authors:Yizhi Li, Ge Zhang, Hanhua Hong, Yiwen Wang, Chenghua Lin
Abstract:
As natural language processing for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques, such as pre-trained language models, suffer from biased corpus. This case becomes more obvious regarding those languages with less fairness-related computational linguistic resources, such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation (CORGI-PM), which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. It is worth noting that CORGI-PM contains 5.2k gender-biased sentences along with the corresponding bias-eliminated versions rewritten by human annotators. We pose three challenges as a shared task to automate the mitigation of textual gender bias, which requires the models to detect, classify, and mitigate textual gender bias. In the literature, we present the results and analysis for the teams participating this shared task in NLPCC 2025.
中文:研究者开发了CORGI-PM中文性别偏见数据集,包含3.29万条标注句子和5200对偏见原文与去偏见版本,通过三项挑战任务推动性别偏见的自动检测、分类与消除。
English: Researchers introduce CORGI-PM, a Chinese dataset with 32.9k annotated sentences including 5.2k gender-biased examples and their revised versions, to address gender bias in NLP through detection, classification, and mitigation tasks.
Authors:Keyi Liu, Weidong Yang, Ben Fei, Ying He
Abstract:
Self-supervised learning (SSL) for point cloud pre-training has become a cornerstone for many 3D vision tasks, enabling effective learning from large-scale unannotated data. At the scene level, existing SSL methods often incorporate volume rendering into the pre-training framework, using RGB-D images as reconstruction signals to facilitate cross-modal learning. This strategy promotes alignment between 2D and 3D modalities and enables the model to benefit from rich visual cues in the RGB-D inputs. However, these approaches are limited by their reliance on implicit scene representations and high memory demands. Furthermore, since their reconstruction objectives are applied only in 2D space, they often fail to capture underlying 3D geometric structures. To address these challenges, we propose Gaussian2Scene, a novel scene-level SSL framework that leverages the efficiency and explicit nature of 3D Gaussian Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the computational burden associated with volume rendering but also supports direct 3D scene reconstruction, thereby enhancing the geometric understanding of the backbone network. Our approach follows a progressive two-stage training strategy. In the first stage, a dual-branch masked autoencoder learns both 2D and 3D scene representations. In the second stage, we initialize training with reconstructed point clouds and further supervise learning using the geometric locations of Gaussian primitives and rendered RGB images. This process reinforces both geometric and cross-modal learning. We demonstrate the effectiveness of Gaussian2Scene across several downstream 3D object detection tasks, showing consistent improvements over existing pre-training methods.
中文: Gaussian2Scene是一种创新的自监督学习框架,利用3D高斯泼溅进行高效的场景级预训练,通过两阶段训练策略增强几何理解能力,并在多个3D物体检测任务中展现出优越性能。
English: Gaussian2Scene is a novel self-supervised learning framework that leverages 3D Gaussian Splatting for efficient scene-level pre-training, enhancing geometric understanding through a two-stage training strategy and demonstrating superior performance in 3D object detection tasks.
Authors:Han Wang, Ruoyun He, Guoguang Lao, Ting Liu, Hejiao Luo, Changqi Qin, Hongying Luo, Junmin Huang, Zihan Wei, Lu Chen, Yongzhi Xu, Ziqian Bi, Junhao Song, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Huafeng Liu, Junfeng Hao, Chunjie Tian
Abstract:
Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (CriticalWindow-24) benchmark, ALFIA surpasses state-of-the-art tabular classifiers in AUPRC while preserving a balanced precision-recall profile. The embeddings produced by ALFIA's fusion module, capturing both fine-grained clinical cues and high-level concepts, enable seamless pairing with GBDTs (CatBoost/LightGBM) as ALFIA-boost, and deep neuro networks as ALFIA-nn, yielding additional performance gains. Our experiments confirm ALFIA's superior early-warning performance, by operating directly on routine clinical text, it furnishes clinicians with a convenient yet robust tool for risk stratification and timely intervention in critical-care settings.
中文摘要:ALFIA是一种基于自适应注意力机制的模型,通过融合BERT多层次语义特征提升ICU高危患者早期识别能力,其性能超越现有最优方法,并能与多种算法无缝结合,为临床提供精准可靠的风险预警工具。
English Summary: ALFIA is an adaptive attention-based model that enhances ICU patient risk prediction by fusing multi-layer BERT features, outperforming existing methods and integrating seamlessly with other algorithms for improved early warning in critical care.
Authors:Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov
Abstract:
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
中文摘要:SPARTA ALIGNMENT是一种通过多LLM组建"斯巴达部落"进行对抗竞赛与相互评估的算法,采用改进的Elo排名系统生成偏好数据驱动模型集体进化,在多项任务中展现出更优的泛化能力和输出质量。
English Summary: SPARTA ALIGNMENT is a competitive algorithm where multiple LLMs form a tribe to duel and peer-evaluate each other, using an adapted Elo-ranking system to generate preference pairs for collective learning and self-evolution, achieving superior performance across diverse tasks.
Authors:Yu Ma, Xiao Li, Chongtao Guo, Le Liang, Michail Matthaiou, Shi Jin
Abstract:
This paper investigates a joint beamforming and resource allocation problem in downlink reconfigurable intelligent surface (RIS)-assisted orthogonal frequency division multiplexing (OFDM) systems to minimize the average delay, where data packets for each user arrive at the base station (BS) stochastically. The sequential optimization problem is inherently a Markov decision process (MDP), thus falling within the remit of reinforcement learning. To effectively handle the mixed action space and reduce the state space dimensionality, a hybrid deep reinforcement learning (DRL) approach is proposed. Specifically, proximal policy optimization (PPO)-Theta is employed to optimize the RIS phase shift design, while PPO-N is responsible for subcarrier allocation decisions. The active beamforming at the BS is then derived from the jointly optimized RIS phase shifts and subcarrier allocation decisions. To further mitigate the curse of dimensionality associated with subcarrier allocation, a multi-agent strategy is introduced to optimize the subcarrier allocation indicators more efficiently. Moreover, to achieve more adaptive resource allocation and accurately capture the network dynamics, key factors closely related to average delay, such as the number of backlogged packets in buffers and current packet arrivals, are incorporated into the state space. Furthermore, a transfer learning framework is introduced to enhance the training efficiency and accelerate convergence. Simulation results demonstrate that the proposed algorithm significantly reduces the average delay, enhances resource allocation efficiency, and achieves superior system robustness and fairness compared to baseline methods.
中文摘要:本文提出一种混合深度强化学习方法,通过多智能体策略和迁移学习联合优化智能反射面辅助OFDM系统的波束成形和资源分配,有效降低了平均时延。
English Summary: This paper proposes a hybrid deep reinforcement learning approach to minimize average delay in RIS-assisted OFDM systems by jointly optimizing beamforming and resource allocation through multi-agent strategies and transfer learning.
Authors:Xiaokun Teng, Wankai Tang, Xiao Li, Shi Jin
Abstract:
Diffractive deep neural network (D2NN), also referred to as reconfigurable intelligent metasurface based deep neural networks (Rb-DNNs) or stacked intelligent metasurfaces (SIMs) in the field of wireless communications, has emerged as a promising signal processing paradigm that enables computing-by-propagation. However, existing architectures are limited to implementing specific functions such as precoding and combining, while still relying on digital baseband modules for other essential tasks like modulation and detection. In this work, we propose a baseband-free end-to-end (BBF-E2E) wireless communication system where modulation, beamforming, and detection are jointly realized through the propagation of electromagnetic (EM) waves. The BBF-E2E system employs D2NNs at both the transmitter and the receiver, forming an autoencoder architecture optimized as a complex-valued neural network. The transmission coefficients of each metasurface layer are trained using the mini-batch stochastic gradient descent method to minimize the cross-entropy loss. To reduce computational complexity during diffraction calculation, the angular spectrum method (ASM) is adopted in place of the Rayleigh-Sommerfeld formula. Extensive simulations demonstrate that BBF-E2E achieves robust symbol transmission under challenging channel conditions with significantly reduced hardware requirements. In particular, the proposed system matches the performance of a conventional multi-antenna system with 81 RF chains while requiring only a single RF chain and 1024 passive elements of metasurfaces. These results highlight the potential of wave-domain neural computing to replace digital baseband modules in future wireless transceivers.
中文摘要:该研究提出了一种无基带端到端无线通信系统,通过电磁波传播实现调制、波束成形和检测的联合处理,在显著降低硬件需求的同时达到了与传统多天线系统相当的性能表现。
English Summary: The proposed baseband-free end-to-end wireless system uses diffractive deep neural networks to perform modulation, beamforming, and detection entirely through electromagnetic wave propagation, achieving performance comparable to conventional multi-antenna systems with significantly reduced hardware requirements.
Authors:Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji
Abstract:
Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.
Chinese: 近期大规模文本到图像生成对抗网络通过引入预训练模型降低了训练成本,但往往牺牲生成多样性,因此提出SCAD模型,采用专用判别器和切片对抗网络,在显著降低训练成本的同时提升了生成多样性和保真度。
English: Recent large-scale text-to-image GANs reduce training costs by incorporating pre-trained models but often sacrifice generation diversity, prompting the proposal of SCAD, which uses specialized discriminators with SANs to enhance both diversity and fidelity at significantly lower training costs.
Authors:Artemis Panagopoulou, Le Xue, Honglu Zhou, silvio savarese, Ran Xu, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Abstract:
Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt. Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples. While task-specific fine-tuning helps improve performance by 56% relative to baseline, state-of-the-art models still achieve only an absolute of 56% accuracy overall and 42% in four-modality settings, underscoring a significant limitation in current multimodal models.
Chinese: 本文提出了Contra4数据集,用于评估跨四种模态的对比性跨模态推理能力,结果表明尽管经过特定任务优化,当前多模态模型在根据自然语言提示选择最相关模态方面仍存在显著不足。
English: This paper introduces Contra4, a dataset designed to evaluate contrastive cross-modal reasoning across four modalities, revealing that current multimodal models still struggle significantly with selecting the most relevant modality based on natural language prompts despite task-specific improvements.
Authors:Xinyi Liu, Lipeng Ma, Yixuan Li, Weidong Yang, Qingyuan Zhou, Jiayi Song, Shuhao Li, Ben Fei
Abstract:
Large Language Models (LLMs) are widely used across various scenarios due to their exceptional reasoning capabilities and natural language understanding. While LLMs demonstrate strong performance in tasks involving mathematics and coding, their effectiveness diminishes significantly when applied to chemistry-related problems. Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology, including specialized symbol systems and complex nomenclature conventions. These characteristics often cause general LLMs to experience hallucinations during the reasoning process due to their lack of specific knowledge. However, existing methods are struggling to effectively leverage chemical expertise and formulas. Moreover, current uncertainty estimation methods, designed to mitigate potential reasoning errors, are unable to precisely identify specific steps or key knowledge. In this work, we propose a novel framework called ChemAU, which incorporates our adaptive uncertainty estimation method that applies different uncertainty values based on the position of reasoning steps within the whole reasoning chain. Leveraging this method, ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model, thereby correcting and updating the previously flawed reasoning chain. Our experiments with three popular LLMs across three chemistry datasets demonstrate that ChemAU significantly enhances both reasoning accuracy and uncertainty estimation.
中文: 大语言模型在处理化学问题时因专业术语和复杂推理链而表现不佳,但提出的ChemAU框架通过自适应不确定性估计和补充领域知识,显著提升了推理准确性。
English: Large Language Models struggle with chemistry problems due to complex terminology and reasoning chains, but the proposed ChemAU framework enhances accuracy by adaptively estimating uncertainty and supplementing domain knowledge.
Authors:Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov
Abstract:
Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don't need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph with post-training alignment with synthetic data. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that post-training alignment would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting on synthetic data. We employ post-training alignment algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our post-training alignment recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards on synthetic data but not on real-world tasks, and compositionality and explainable intermediate steps remains a critical challenge even after post-training alignment.
Chinese: 本研究提出通过合成图数据的后训练对齐来增强大语言模型从合成数据到隐含图结构现实任务的泛化能力,在多个数据集上取得显著提升,同时揭示了组合性与可解释推理方面的持续挑战。
English: This study proposes using post-training alignment with synthetic graph data to enhance LLMs' generalization from synthetic to real-world tasks with implicit graph structures, achieving significant improvements across multiple datasets while highlighting challenges in compositionality and explainable reasoning.
Authors:Zifeng Zhu, Shangbin Feng, Herun Wan, Ningnan Wang, Minnan Luo, Yulia Tsvetkov
Abstract:
We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from "Guess the Build", an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.
中文摘要:GuessBench是一个新颖的基准测试,通过《我的世界》"猜建筑"游戏数据评估视觉语言模型对人类创造力的建模能力,揭示了模型间的显著性能差距,并证明基于其推理轨迹的微调可使视觉感知任务平均提升15.36%。
English Summary: GuessBench is a novel benchmark that evaluates Vision Language Models' ability to model human creativity using data from Minecraft's "Guess the Build" game, revealing significant performance gaps between models and showing fine-tuning on its reasoning traces improves visual perception tasks by 15.36%.
Authors:Shangbin Feng, Yike Wang, Weijia Shi, Yulia Tsvetkov
Abstract:
We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.
Chinese: Data Swarms 算法通过粒子群优化技术优化合成评估数据的生成,提升数据质量和模型鲁棒性,其扩展版本 Adversarial Swarms 则通过数据生成器与测试模型的协同进化,实现两者的同步增强。
English: Data Swarms is an algorithm that optimizes synthetic data generation for LLM evaluation using particle swarm optimization, enhancing data quality and model robustness, while its extension, Adversarial Swarms, co-evolves data generators and test models for simultaneous improvement.
Authors:Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye
Abstract:
Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.
中文: 本文提出了PHYSICS数据集,包含16,568个涵盖多学科和难度级别的物理问题,并设计了专门的规则+模型评估框架,以解决当前大语言模型在物理推理方面的不足。
English: This paper introduces PHYSICS, a comprehensive dataset of 16,568 physics problems across subjects and difficulty levels, along with a tailored Rule+Model evaluation framework to address current LLMs' limitations in physical reasoning.
Authors:Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.
中文: 本文提出动态对数校准(DLC)方法,通过推理时动态对齐文本生成与视觉证据,有效减少大型视觉语言模型的幻觉现象,在多种基准测试中显著提升模型可靠性并保持高效推理。
English: This paper introduces Dynamic Logits Calibration (DLC), a training-free decoding framework that reduces hallucinations in Large Vision-Language Models by dynamically aligning text generation with visual evidence during inference, improving reliability and efficiency across multiple benchmarks.
Authors:Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
Abstract:
We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.
Chinese: 本研究提出了一种分步式视频到音频生成方法,通过逐步合成缺失声音来提升控制性和真实感,利用负向引导避免重复,并基于标准数据集进行训练,显著改善了音频质量和可分离性。
English: This study introduces a step-by-step video-to-audio generation method that enhances controllability and realism by incrementally synthesizing missing sounds, avoiding duplication through negative guidance and leveraging standard datasets for training.
Authors:Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
Abstract:
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.
Chinese: 本研究提出了一种分步式视频到音频生成方法,通过逐步合成缺失声音来提升控制性和真实感,利用负向引导避免重复,并基于标准数据集进行训练,显著改善了音频质量和可分离性。
English: This study introduces a step-by-step video-to-audio generation method that enhances controllability and realism by incrementally synthesizing missing sounds, avoiding duplication through negative guidance and leveraging standard datasets for training.
Authors:Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang
Abstract:
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.
中文:CronusVLA通过历史帧运动特征编码和跨注意力解码,将单帧视觉语言动作模型高效扩展至多帧观测框架,在仿真与真实机器人任务中均实现了最先进的性能表现。
English: CronusVLA is a unified framework that efficiently extends single-frame vision-language-action models to multi-frame observation by encoding motion features from historical frames and decoding them through cross-attention, achieving state-of-the-art performance in both simulated and real-world robotic tasks.
Authors:Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji
Abstract:
We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synthesis cannot handle in-the-wild videos. Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry. By integrating geometric priors, the generative model focuses on synthesizing realistic details where the estimated geometry is uncertain. We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data. Our method outperforms baselines in producing plausible videos from novel camera trajectories, especially in extreme extrapolation scenarios on real-world footage.
中文: Vid-CamEdit是一种创新框架,通过结合几何估计与生成式渲染,能够沿自定义相机路径重新合成视频,无需大量4D训练数据即可有效处理极端轨迹变化。
English: Vid-CamEdit is a novel framework that enables video re-synthesis along custom camera paths by combining geometric estimation with generative rendering, effectively handling extreme trajectory changes without requiring extensive 4D training data.
Authors:Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, Xing Xie
Abstract:
Large language models (LLMs) have been widely adopted as the core of agent frameworks in various scenarios, such as social simulations and AI companions. However, the extent to which they can replicate human-like motivations remains an underexplored question. Existing benchmarks are constrained by simplistic scenarios and the absence of character identities, resulting in an information asymmetry with real-world situations. To address this gap, we propose MotiveBench, which consists of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation. Using MotiveBench, we conduct extensive experiments on seven popular model families, comparing different scales and versions within each family. The results show that even the most advanced LLMs still fall short in achieving human-like motivational reasoning. Our analysis reveals key findings, including the difficulty LLMs face in reasoning about "love & belonging" motivations and their tendency toward excessive rationality and idealism. These insights highlight a promising direction for future research on the humanization of LLMs. The dataset, benchmark, and code are available at https://aka.ms/motivebench.
中文摘要:MotiveBench作为评估大型语言模型模拟人类动机推理能力的综合基准,揭示了即使先进模型在情感动机理解上仍存在不足,并表现出过度理性化的倾向。
English Summary: MotiveBench is introduced as a comprehensive benchmark to evaluate large language models' ability to replicate human-like motivational reasoning, revealing that even advanced models struggle with emotional motivations and exhibit excessive rationality.
Authors:Yuzhou Yang, Yangming Zhou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Sheng Li
Abstract:
The proliferation of deceptive content online necessitates robust Fake News Detection (FND) systems. While evidence-based approaches leverage external knowledge to verify claims, existing methods face critical limitations: noisy evidence selection, generalization bottlenecks, and unclear decision-making processes. Recent efforts to harness Large Language Models (LLMs) for FND introduce new challenges, including hallucinated rationales and conclusion bias. To address these issues, we propose \textbf{RoE-FND} (\textbf{\underline{R}}eason \textbf{\underline{o}}n \textbf{\underline{E}}xperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with experiential learning. RoE-FND encompasses two stages: (1) \textit{self-reflective knowledge building}, where a knowledge base is curated by analyzing past reasoning errors, namely the exploration stage, and (2) \textit{dynamic criterion retrieval}, which synthesizes task-specific reasoning guidelines from historical cases as experiences during deployment. It further cross-checks rationales against internal experience through a devised dual-channel procedure. Key contributions include: a case-based reasoning framework for FND that addresses multiple existing challenges, a training-free approach enabling adaptation to evolving situations, and empirical validation of the framework's superior generalization and effectiveness over state-of-the-art methods across three datasets.
中文:RoE-FND框架通过将大语言模型与经验学习相结合,将假新闻检测重构为逻辑推理任务,采用无需训练的案例推理方法,在三个数据集上验证了其优于现有技术的泛化能力和有效性。
English: The RoE-FND framework addresses limitations in fake news detection by integrating large language models with experiential learning, reframing it as a logical deduction task and demonstrating superior generalization across datasets through a training-free, case-based reasoning approach.
Authors:Ning Gao, Yilun Chen, Shuai Yang, Xinyi Chen, Yang Tian, Hao Li, Haifeng Huang, Hanqing Wang, Tai Wang, Jiangmiao Pang
Abstract:
Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies. It features an automatic pipeline via LLM-driven task-oriented scene graph to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. To systematically assess generalization, we present GenManip-Bench, a benchmark of 200 scenarios refined via human-in-the-loop corrections. We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection. Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios. We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions. Project Page: https://genmanip.axi404.top/.
中文摘要:GenManip仿真平台通过构建大规模多样化任务场景与标准化测评基准,揭示了融合基础模型的模块化系统相比端到端策略在机器人操作泛化能力上的显著优势。
English Summary: The GenManip simulation platform addresses robotic manipulation generalization challenges by enabling systematic evaluation of modular foundation model systems and end-to-end policies, demonstrating superior generalization with enhanced foundation models.
Authors:Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak
Abstract:
The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.
中文: 本研究揭示了与价值观对齐的大语言模型更容易因真实遵循价值观而引发安全风险,并提出了情境对齐方法来增强这类模型的安全性。
English: This study identifies safety risks in value-aligned large language models, revealing they are more prone to harmful behaviors due to genuine value-based text generation, and proposes in-context alignment methods to mitigate these risks.
Authors:Rui Zhang, Yuanbo Wen, Shuyao Cheng, Di Huang, Shaohui Peng, Jiaming Guo, Pengwei Jin, Jiacheng Zhao, Tianrui Ma, Yaoyu Zhu, Yifan Hao, Yongwei Zhao, Shengwen Liang, Ying Wang, Xing Hu, Zidong Du, Huimin Cui, Ling Li, Qi Guo, Yunji Chen
Abstract:
Processor chip design technology serves as a key frontier driving breakthroughs in computer science and related fields. With the rapid advancement of information technology, conventional design paradigms face three major challenges: the physical constraints of fabrication technologies, the escalating demands for design resources, and the increasing diversity of ecosystems. Automated processor chip design has emerged as a transformative solution to address these challenges. While recent breakthroughs in Artificial Intelligence (AI), particularly Large Language Models (LLMs) techniques, have opened new possibilities for fully automated processor chip design, substantial challenges remain in establishing domain-specific LLMs for processor chip design.
In this paper, we propose QiMeng, a novel system for fully automated hardware and software design of processor chips. QiMeng comprises three hierarchical layers. In the bottom-layer, we construct a domain-specific Large Processor Chip Model (LPCM) that introduces novel designs in architecture, training, and inference, to address key challenges such as knowledge representation gap, data scarcity, correctness assurance, and enormous solution space. In the middle-layer, leveraging the LPCM's knowledge representation and inference capabilities, we develop the Hardware Design Agent and the Software Design Agent to automate the design of hardware and software for processor chips. Currently, several components of QiMeng have been completed and successfully applied in various top-layer applications, demonstrating significant advantages and providing a feasible solution for efficient, fully automated hardware/software design of processor chips. Future research will focus on integrating all components and performing iterative top-down and bottom-up design processes to establish a comprehensive QiMeng system.
中文摘要:本文提出启梦系统,通过构建领域专用大模型实现处理器芯片硬件与软件的自动化设计,部分组件已成功应用,为解决全流程自动化设计提供了可行方案。
English Summary: The paper introduces QiMeng, an automated system for processor chip design that employs a domain-specific large model to address hardware and software design challenges, with partial components already demonstrating practical success.
Authors:Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang
Abstract:
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.
中文: 近期慢思考语言模型在复杂推理任务中表现出色,但将其扩展至多模态模型因重训练成本高昂而面临挑战,为此提出的RACRO方法通过强化学习使视觉提取与推理目标对齐,显著提升了性能与可扩展性。
English: Recent advances in slow-thinking language models have shown strong reasoning capabilities, but extending these to multi-modal models is challenging due to high retraining costs, leading to the proposed RACRO method that aligns visual extraction with reasoning objectives through reinforcement learning for improved performance and scalability.
Authors:Sicong Han, Chenhao Lin, Zhengyu Zhao, Xiyuan Wang, Xinlei He, Qian Li, Cong Wang, Qian Wang, Chao Shen
Abstract:
Adversarial detection protects models from adversarial attacks by refusing suspicious test samples. However, current detection methods often suffer from weak generalization: their effectiveness tends to degrade significantly when applied to adversarially trained models rather than naturally trained ones, and they generally struggle to achieve consistent effectiveness across both white-box and black-box attack settings. In this work, we observe that an auxiliary model, differing from the primary model in training strategy or model architecture, tends to assign low confidence to the primary model's predictions on adversarial examples (AEs), while preserving high confidence on normal examples (NEs). Based on this discovery, we propose Prediction Inconsistency Detector (PID), a lightweight and generalizable detection framework to distinguish AEs from NEs by capturing the prediction inconsistency between the primal and auxiliary models. PID is compatible with both naturally and adversarially trained primal models and outperforms four detection methods across 3 white-box, 3 black-box, and 1 mixed adversarial attacks. Specifically, PID achieves average AUC scores of 99.29\% and 99.30\% on CIFAR-10 when the primal model is naturally and adversarially trained, respectively, and 98.31% and 96.81% on ImageNet under the same conditions, outperforming existing SOTAs by 4.70%$\sim$25.46%.
Chinese: 提出的预测不一致检测器(PID)利用主模型与辅助模型之间的预测差异来有效区分对抗样本和正常样本,在各种攻击场景和模型训练方法下均实现了卓越的检测性能。
English: The proposed Prediction Inconsistency Detector (PID) leverages prediction discrepancies between primary and auxiliary models to effectively distinguish adversarial examples from normal ones, achieving superior detection performance across various attack scenarios and model training methods.
Authors:Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, Pavlo Molchanov
Abstract:
Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Here we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Our position, formulated as a value statement, highlights the significance of the operational and economic impact even a partial shift from LLMs to SLMs is to have on the AI agent industry. We aim to stimulate the discussion on the effective use of AI resources and hope to advance the efforts to lower the costs of AI of the present day. Calling for both contributions to and critique of our position, we commit to publishing all such correspondence at https://research.nvidia.com/labs/lpr/slm-agents.
中文: 大型语言模型在代理AI系统中处理重复性专门任务时显得大材小用,因此更经济的小型模型将成为此类应用的主流,而对话需求则适合采用多模型系统。
English: Large language models are overqualified for repetitive specialized tasks in agentic AI systems, making smaller, more economical models the future for such applications while reserving conversational needs for multi-model systems.
Authors:Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu
Abstract:
Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/.
中文: 本文提出了STORM数据集与基准,旨在提升多模态大语言模型在五个领域的视觉排序能力,包含65.5万图像-文本对及一种新颖的从粗到细处理流程,以增强模型在序数回归任务中的可信表现。
English: This paper introduces STORM, a comprehensive dataset and benchmark designed to enhance the visual rating capabilities of multi-modal large language models (MLLMs) in ordinal regression tasks across five domains, featuring 655K image-text pairs and a novel coarse-to-fine processing pipeline for trustworthy performance.
Authors:Petros Raptopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou
Abstract:
Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward-ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
中文: PAKTON是一个开源的多智能体框架,通过协作工作流和检索增强生成技术提升合同审查的可用性与隐私保护,在准确性和可解释性上优于现有模型。
English: PAKTON is an open-source multi-agent framework that enhances contract review accessibility and privacy through collaborative workflows and retrieval-augmented generation, outperforming existing models in accuracy and explainability.
Authors:Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang
Abstract:
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offering distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structural information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code will be released.
中文: 研究发现,压缩型视觉语言投影器会给大型视觉语言模型带来严重安全漏洞,而非压缩型投影器具备稳健安全性,为选择安全的投影器提供了关键指导。
English: The study finds that compressed visual language projectors (VLPs) introduce significant security vulnerabilities in large visual language models (LVLMs), while uncompressed projectors maintain robust security, offering crucial guidance for selecting secure VLPs.
Authors:Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Abstract:
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
Chinese: TaP框架通过结构化分类法实现自动化、可扩展的多语言偏好数据集生成,使用该数据集微调的大语言模型性能优于基于现有大180倍开源数据集训练的模型。
English: The TaP framework enables automated, scalable generation of multilingual preference datasets through a structured taxonomy, allowing LLMs fine-tuned with these datasets to outperform models trained on significantly larger existing resources.
Authors:Hongyi Pan, Ziliang Hong, Gorkem Durak, Ziyue Xu, Ulas Bagci
Abstract:
Federated learning (FL) has emerged as a promising paradigm for collaboratively training deep learning models across institutions without exchanging sensitive medical data. However, its effectiveness is often hindered by limited data availability and non-independent, identically distributed data across participating clients, which can degrade model performance and generalization. To address these challenges, we propose a generative AI based data augmentation framework that integrates synthetic image sharing into the federated training process for breast cancer diagnosis via ultrasound images. Specifically, we train two simple class-specific Deep Convolutional Generative Adversarial Networks: one for benign and one for malignant lesions. We then simulate a realistic FL setting using three publicly available breast ultrasound image datasets: BUSI, BUS-BRA, and UDIAT. FedAvg and FedProx are adopted as baseline FL algorithms. Experimental results show that incorporating a suitable number of synthetic images improved the average AUC from 0.9206 to 0.9237 for FedAvg and from 0.9429 to 0.9538 for FedProx. We also note that excessive use of synthetic data reduced performance, underscoring the importance of maintaining a balanced ratio of real and synthetic samples. Our findings highlight the potential of generative AI based data augmentation to enhance FL results in the breast ultrasound image classification task.
中文: 本研究提出了一种基于生成式人工智能的数据增强框架,将合成的乳腺超声图像整合到联邦学习中,在提升乳腺癌诊断性能的同时强调了保持真实与合成数据平衡比例的重要性。
English: This study proposes a generative AI-based data augmentation framework that integrates synthetic breast ultrasound images into federated learning, demonstrating improved diagnostic performance for breast cancer while highlighting the need for balanced real-to-synthetic data ratios.
Authors:Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin
Abstract:
Dysarthric speech recognition (DSR) enhances the accessibility of smart devices for dysarthric speakers with limited mobility. Previously, DSR research was constrained by the fact that existing datasets typically consisted of isolated words, command phrases, and a limited number of sentences spoken by a few individuals. This constrained research to command-interaction systems and speaker adaptation. The Speech Accessibility Project (SAP) changed this by releasing a large and diverse English dysarthric dataset, leading to the SAP Challenge to build speaker- and text-independent DSR systems. We enhanced the Whisper model's performance on long dysarthric speech via a novel self-training method. This method increased training data and adapted the model to handle potentially incomplete speech segments encountered during inference. Our system achieved second place in both Word Error Rate and Semantic Score in the SAP Challenge.
中文摘要:言语无障碍项目发布了大规模多样化构音障碍语音数据集,推动开发独立于说话者和文本的识别系统,我们通过自训练方法增强的Whisper模型在该挑战赛中词错误率和语义得分均获第二名。
English Summary: The Speech Accessibility Project introduced a large, diverse dysarthric speech dataset, enabling the development of speaker-independent recognition systems where our enhanced Whisper model using self-training achieved second place in the SAP Challenge.
Authors:Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas, David P. Woodruff, Samson Zhou
Abstract:
We study the problem of distributed distinct element estimation, where $α$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $Î\left(α\log n+\fracα{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \fracβ{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $\mathcal{O}\left(α\log n+\frac{\sqrtβ}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.
中文: 本文基于成对碰撞次数提出新的参数化方法,设计出通信高效的分布式相异元素估计算法,在改进上界的同时给出匹配下界,揭示了理论困难问题在实践中高效解决的原因。
English: This paper introduces a novel parameterization based on pairwise collisions to develop a communication-efficient protocol for distributed distinct element estimation, achieving improved bounds and providing matching lower bounds that explain practical efficiency despite theoretical hardness.
Authors:Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
Abstract:
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.
中文:Kling-Foley是一种多模态视频到音频生成模型,通过扩散变换器和专用模块实现高质量音频合成,在多种声音类型中提升语义和时间对齐效果。
English: Kling-Foley is a multimodal video-to-audio generation model that uses diffusion transformers and specialized modules to produce high-quality, synchronized audio with enhanced semantic and temporal alignment across various sound types.
Authors:Florian Peter Busch, Moritz Willig, Florian Guldan, Kristian Kersting, Devendra Singh Dhami
Abstract:
Not every causal relation between variables is equal, and this can be leveraged for the task of causal discovery. Recent research shows that pairs of variables with particular type assignments induce a preference on the causal direction of other pairs of variables with the same type. Although useful, this assignment of a specific type to a variable can be tricky in practice. We propose a tag-based causal discovery approach where multiple tags are assigned to each variable in a causal graph. Existing causal discovery approaches are first applied to direct some edges, which are then used to determine edge relations between tags. Then, these edge relations are used to direct the undirected edges. Doing so improves upon purely type-based relations, where the assumption of type consistency lacks robustness and flexibility due to being restricted to single types for each variable. Our experimental evaluations show that this boosts causal discovery and that these high-level tag relations fit common knowledge.
中文摘要:提出的基于标签的因果发现方法为变量分配多个标签,利用初始确定的边方向推断标签关系,进而指导剩余因果方向的发现,相比单一类型方法在鲁棒性和准确性上表现更优。
English Summary: The proposed tag-based causal discovery method assigns multiple tags to variables, using initial edge directions to determine tag relations which then guide the discovery of remaining causal directions, outperforming single-type approaches in robustness and accuracy.
Authors:Zelin Zang, Fei Wang, Liangyu Li, Jinlin Wu, Chunshui Zhao, Zhen Lei, Baigui Sun
Abstract:
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent UDA methods based on Vision Transformers (ViTs) have achieved strong performance through attention-based feature alignment. However, we identify a key limitation: foreground object mismatch, where the discrepancy in foreground object size and spatial distribution across domains weakens attention consistency and hampers effective domain alignment. To address this issue, we propose the Progressive Focus Cross-Attention Mechanism (PCaM), which progressively filters out background information during cross-attention, allowing the model to focus on and fuse discriminative foreground semantics across domains. We further introduce an attentional guidance loss that explicitly directs attention toward task-relevant regions, enhancing cross-domain attention consistency. PCaM is lightweight, architecture-agnostic, and easy to integrate into existing ViT-based UDA pipelines. Extensive experiments on Office-Home, DomainNet, VisDA-2017, and remote sensing datasets demonstrate that PCaM significantly improves adaptation performance and achieves new state-of-the-art results, validating the effectiveness of attention-guided foreground fusion for domain adaptation.
中文: 本文提出的PCaM机制通过渐进式过滤背景信息并利用交叉注意力融合跨域前景特征,有效解决了无监督域适应中的前景错配问题,在多个数据集上实现了最优性能。
English: This paper introduces PCaM, a lightweight mechanism that enhances unsupervised domain adaptation by progressively filtering background noise and aligning foreground features through cross-attention, achieving state-of-the-art results across multiple datasets.
Authors:Yun Xing, Yue Cao, Nhat Chung, Jie Zhang, Ivor Tsang, Ming-Ming Cheng, Yang Liu, Lei Ma, Qing Guo
Abstract:
Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated texture structures perform poorly in physical-world implementations, i.e., when deployed as patches, limiting their practical utility for testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals between repeated textures, creating a striped structure, significantly enhances the patch attack effectiveness. Through extensive experimentation, we analyze how variations of this novel structure influence the performance. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack state-of-the-art stereo depth estimation methods, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems.
中文摘要:本研究提出了一种带有条纹结构的新型对抗性补丁,显著提升了对立体深度估计系统的攻击效果,在实际场景中成功针对先进算法和商用RGB-D相机进行了有效攻击。
English Summary: This research introduces a novel adversarial patch with striped structures that significantly enhances attack effectiveness on stereo depth estimation systems, successfully targeting both advanced algorithms and commercial RGB-D cameras in real-world scenarios.
Authors:Chengyu Bai, Yuming Li, Zhongyu Zhao, Jintao Chen, Peidong Jia, Qi She, Ming Lu, Shanghang Zhang
Abstract:
Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.
Chinese: FastInit提出了一种快速噪声初始化方法,通过视频噪声预测网络单步生成优化噪声,无需迭代优化即可显著提升视频生成的效率与帧间时序一致性。
English: FastInit introduces a rapid noise initialization technique that uses a Video Noise Prediction Network to generate refined noise in one step, significantly boosting video generation efficiency and temporal consistency without iterative refinement.
Authors:Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang
Abstract:
Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}\%$ accuracy gain on LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}\%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.
中文摘要:AutoV是一种创新方法,能够根据文本查询和输入图像自动从多种候选中选择最佳视觉提示,显著提升了大型视觉语言模型在多项图像理解任务中的性能表现。
English summary: AutoV is a novel method that automatically selects the optimal visual prompt from multiple candidates based on textual queries and input images, significantly improving the performance of large vision-language models across various image understanding tasks.
Authors:Elif Keles, Merve Yazol, Gorkem Durak, Ziliang Hong, Halil Ertugrul Aktas, Zheyuan Zhang, Linkai Peng, Onkar Susladkar, Necati Guzelyel, Oznur Leman Boyunaga, Cemal Yazici, Mark Lowe, Aliye Uc, Ulas Bagci
Abstract:
Objective: Our study aimed to evaluate and validate PanSegNet, a deep learning (DL) algorithm for pediatric pancreas segmentation on MRI in children with acute pancreatitis (AP), chronic pancreatitis (CP), and healthy controls. Methods: With IRB approval, we retrospectively collected 84 MRI scans (1.5T/3T Siemens Aera/Verio) from children aged 2-19 years at Gazi University (2015-2024). The dataset includes healthy children as well as patients diagnosed with AP or CP based on clinical criteria. Pediatric and general radiologists manually segmented the pancreas, then confirmed by a senior pediatric radiologist. PanSegNet-generated segmentations were assessed using Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff distance (HD95). Cohen's kappa measured observer agreement. Results: Pancreas MRI T2W scans were obtained from 42 children with AP/CP (mean age: 11.73 +/- 3.9 years) and 42 healthy children (mean age: 11.19 +/- 4.88 years). PanSegNet achieved DSC scores of 88% (controls), 81% (AP), and 80% (CP), with HD95 values of 3.98 mm (controls), 9.85 mm (AP), and 15.67 mm (CP). Inter-observer kappa was 0.86 (controls), 0.82 (pancreatitis), and intra-observer agreement reached 0.88 and 0.81. Strong agreement was observed between automated and manual volumes (R^2 = 0.85 in controls, 0.77 in diseased), demonstrating clinical reliability. Conclusion: PanSegNet represents the first validated deep learning solution for pancreatic MRI segmentation, achieving expert-level performance across healthy and diseased states. This tool, algorithm, along with our annotated dataset, are freely available on GitHub and OSF, advancing accessible, radiation-free pediatric pancreatic imaging and fostering collaborative research in this underserved domain.
Chinese: PanSegNet作为首个经过验证的深度学习算法,在健康与患病儿童的胰腺MRI分割中均达到专家级性能,其工具和标注数据集已开源,可推动无辐射儿科胰腺影像研究的发展。
English: PanSegNet is a validated deep learning algorithm that achieves expert-level performance in segmenting pediatric pancreases on MRI for both healthy children and those with pancreatitis, with its tools and annotated dataset publicly available to advance radiation-free pancreatic imaging research.
Authors:Chunlei Li, Jingyang Hou, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Abstract:
Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
中文:MRG-LLM提出了一种新颖的多模态模型,通过基于视觉特征的仿射变换为单个医学图像动态定制提示,在基准数据集上实现了医学报告生成的最先进性能。
English: MRG-LLM introduces a novel multimodal model that dynamically customizes prompts for individual medical images through conditional affine transformations, achieving state-of-the-art performance in medical report generation on benchmark datasets.
Authors:Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, Lu Qi
Abstract:
Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.
中文摘要:本研究通过引入全景数据集和专门设计的ERP-RoPE位置编码方案,解决了多模态大语言模型在全景图像理解中面临的空间连续性和信息密度变化等关键挑战,推动了全景场景下的密集视觉语言理解。
English Summary: This research introduces a dataset and ERP-RoPE position encoding to enable multimodal language models to achieve dense visual understanding from omnidirectional panoramas, addressing challenges in spatial continuity and information density variation.
Authors:Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Runtao Liu, Rui Pan, Tong Zhang
Abstract:
Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.
中文摘要:通过结合视觉专家、思维链推理和边界拒绝采样的迭代训练框架,解决了视觉语言奖励模型中的自举困境和模态偏差问题,显著提升了幻觉检测和多模态推理能力。
English Summary: Reinforcement Fine-Tuning with verifiable rewards is advancing Vision-Language models through an iterative framework that addresses training challenges like the bootstrapping dilemma and modality bias, improving hallucination detection and multimodal reasoning.
Authors:Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, Haoang Li
Abstract:
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.
中文: 本文提出结合一致性蒸馏训练与混合标签监督的方法,并采用提前退出解码策略来加速视觉-语言-动作模型,在保持机器人任务高成功率的同时实现了4倍以上的推理加速。
English: This paper introduces a consistency distillation training method with mixed-label supervision and an early-exit decoding strategy to accelerate Vision-Language-Action models, achieving over 4x faster inference while maintaining high task success rates in robotics.
Authors:Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Jiaming Ji, Yaodong Yang, Juntao Dai
Abstract:
The increasing prevalence of large language models (LLMs) is influencing global value systems. However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values. This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the development of equitable and inclusive AI systems. In this work, we introduce a systematic framework designed to boost fair and robust cross-cultural consensus among LLMs. We model consensus as a Nash Equilibrium and employ a game-theoretic negotiation method based on Policy-Space Response Oracles (PSRO) to simulate an organized cross-cultural negotiation process. To evaluate this approach, we construct regional cultural agents using data transformed from the World Values Survey (WVS). Beyond the conventional model-level evaluation method, We further propose two quantitative metrics, Perplexity-based Acceptence and Values Self-Consistency, to assess consensus outcomes. Experimental results indicate that our approach generates consensus of higher quality while ensuring more balanced compromise compared to baselines. Overall, it mitigates WEIRD bias by guiding agents toward convergence through fair and gradual negotiation steps.
中文: 本研究提出了一种基于纳什均衡和策略空间响应预言机的博弈论框架,通过模拟跨文化协商过程促进公平共识,从而减少大型语言模型中的WEIRD文化偏见。
English: This study introduces a game-theoretic framework using Nash Equilibrium and Policy-Space Response Oracles to mitigate WEIRD cultural bias in large language models by fostering fair cross-cultural consensus through simulated negotiations.
Authors:Xiaoyao Zhong, Jiabao Jin, Peng Cheng, Mingyu Yang, Haoyang Li, Zhitao Shen, Heng Tao Shen, Jingkuan Song
Abstract:
Recently, Approximate Nearest Neighbor Search in high-dimensional vector spaces has garnered considerable attention due to the rapid advancement of deep learning techniques. We observed that a substantial amount of search and construction logs are generated throughout the lifespan of a graph-based index. However, these two types of valuable logs are not fully exploited due to the static nature of existing indexes. We present the EnhanceGraph framework, which integrates two types of logs into a novel structure called a conjugate graph. The conjugate graph is then used to improve search quality. Through theoretical analyses and observations of the limitations of graph-based indexes, we propose several optimization methods. For the search logs, the conjugate graph stores the edges from local optima to global optima to enhance routing to the nearest neighbor. For the construction logs, the conjugate graph stores the pruned edges from the proximity graph to enhance retrieving of k nearest neighbors. Our experimental results on several public and real-world industrial datasets show that EnhanceGraph significantly improves search accuracy with the greatest improvement on recall from 41.74% to 93.42%, but does not sacrifices search efficiency. In addition, our EnhanceGraph algorithm has been integrated into Ant Group's open-source vector library, VSAG.
中文:EnhanceGraph框架通过整合搜索和构建日志构建共轭图,在不牺牲搜索效率的前提下显著提升了近似最近邻搜索的准确率,实验数据及在蚂蚁集团VSAG库中的集成应用均验证了其有效性。
English: The EnhanceGraph framework leverages search and construction logs to create a conjugate graph that significantly improves the accuracy of approximate nearest neighbor search without compromising efficiency, as demonstrated by experimental results and integration into Ant Group's VSAG library.
Authors:Jiamin Wang, Yichen Yao, Xiang Feng, Hang Wu, Yaming Wang, Qingqiu Huang, Yuexin Ma, Xinge Zhu
Abstract:
The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
中文: STAGE通过分层时序特征传递和多阶段训练策略,提出了一种创新的自回归框架,有效提升长序列驾驶视频生成中的时序一致性和减少误差累积,在Nuscenes数据集上显著超越了现有方法。
English: STAGE introduces a novel auto-regressive framework with Hierarchical Temporal Feature Transfer and multi-stage training to enhance temporal consistency and reduce errors in long-horizon driving video generation, significantly outperforming existing methods on the Nuscenes dataset.
Authors:Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu
Abstract:
Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.
中文: 本研究提出推理感知表征误导遗忘方法(R²MU),通过有效抑制大型推理模型中思维链轨迹的敏感信息同时保持推理能力,解决了传统遗忘方法的不足。
English: This study introduces Reasoning-aware Representation Misdirection for Unlearning (R²MU), a novel method that effectively removes sensitive information from large reasoning models' chain-of-thought trajectories while preserving reasoning capabilities, addressing limitations of conventional unlearning approaches.
Authors:Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
Abstract:
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
中文:Infini-gram mini 是一种高效系统,能够索引和压缩大型文本语料库,实现可扩展的搜索,并揭示语言模型训练数据中存在的显著基准污染问题。
English: Infini-gram mini is an efficient system that indexes and compresses large text corpora, enabling scalable search and revealing significant benchmark contamination in language model training data.
Authors:Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
Abstract:
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.
中文:Infini-gram mini 是一种高效系统,能够索引和压缩大型文本语料库,实现可扩展的搜索,并揭示语言模型训练数据中存在的显著基准污染问题。
English: Infini-gram mini is an efficient system that indexes and compresses large text corpora, enabling scalable search and revealing significant benchmark contamination in language model training data.
Authors:Chunlei Li, Yilei Shi, Haoxi Hu, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Abstract:
High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available.
Chinese Summary: 本研究提出了一种新颖的框架,通过适配Stable Diffusion进行CT盲超分辨率处理,利用合成的低质量图像和文本引导控制,在降低辐射剂量的同时有效提升了图像质量。
English Summary: This study introduces a novel framework adapting Stable Diffusion for CT blind super-resolution, effectively enhancing image quality while reducing radiation exposure by leveraging synthesized low-quality images and text-guided controls.
Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Ji-Rong Wen, Zhicheng Dou
Abstract:
Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.
中文: VideoDeepResearch提出了一种智能代理框架,仅通过文本推理模型和模块化工具即可实现卓越的长视频理解能力,在多个基准测试中显著超越了现有最先进的多模态大语言模型。
English: VideoDeepResearch introduces an agentic framework that achieves superior long video understanding using only a text-based reasoning model and modular tools, surpassing state-of-the-art MLLMs by significant margins on multiple benchmarks.
Authors:Xiaobei Yan, Han Qiu, Tianwei Zhang
Abstract:
Bit-flip attacks (BFAs) represent a serious threat to Deep Neural Networks (DNNs), where flipping a small number of bits in the model parameters or binary code can significantly degrade the model accuracy or mislead the model prediction in a desired way. Existing defenses exclusively focus on protecting models for specific attacks and platforms, while lacking effectiveness for other scenarios. We propose ObfusBFA, an efficient and holistic methodology to mitigate BFAs targeting both the high-level model weights and low-level codebase (executables or shared libraries). The key idea of ObfusBFA is to introduce random dummy operations during the model inference, which effectively transforms the delicate attacks into random bit flips, making it much harder for attackers to pinpoint and exploit vulnerable bits. We design novel algorithms to identify critical bits and insert obfuscation operations. We evaluate ObfusBFA against different types of attacks, including the adaptive scenarios where the attacker increases the flip bit budget to attempt to circumvent our defense. The results show that ObfusBFA can consistently preserve the model accuracy across various datasets and DNN architectures while significantly reducing the attack success rates. Additionally, it introduces minimal latency and storage overhead, making it a practical solution for real-world applications.
中文: ObfusBFA通过引入随机虚拟操作,将针对性的比特翻转攻击转化为随机攻击,有效保护深度神经网络在各种攻击和数据集下的准确性,且仅产生极小的延迟和存储开销。
English: ObfusBFA is an efficient defense method that introduces random dummy operations during DNN inference to transform targeted bit-flip attacks into random ones, preserving model accuracy with minimal overhead across various attacks and datasets.
Authors:Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh, Wangmeng Zuo, Yong Liu
Abstract:
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
中文: T2I-PAL通过文本生成真实图像并结合热力图与可学习原型,解决了CLIP模型在参数高效微调中的模态差异问题,在无需全标注训练图像的情况下实现了3.47%的性能提升。
English: T2I-PAL addresses the modality gap in CLIP-based parameter-efficient fine-tuning by generating realistic images from text captions and integrating heatmaps with learnable prototypes, achieving a 3.47% performance improvement without requiring fully annotated training data.
Authors:Jiayi Song, Kaiyu Li, Xiangyong Cao, Deyu Meng
Abstract:
Semantic segmentation in remote sensing images is crucial for various applications, yet its performance is heavily reliant on large-scale, high-quality pixel-wise annotations, which are notoriously expensive and time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a promising alternative to mitigate this data dependency. However, existing SSS methods often struggle with the inherent distribution mismatch between limited labeled data and abundant unlabeled data, leading to suboptimal generalization. To alleviate this issue, we attempt to introduce the Vision Foundation Models (VFMs) pre-trained on vast and diverse datasets into the SSS task since VFMs possess robust generalization capabilities that can effectively bridge this distribution gap and provide strong semantic priors for SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and Fusion), a novel framework that leverages the powerful semantic knowledge embedded in VFMs to guide semi-supervised learning in remote sensing. Specifically, RS-MTDF employs multiple frozen VFMs (e.g., DINOv2 and CLIP) as expert teachers, utilizing feature-level distillation to align student features with their robust representations. To further enhance discriminative power, the distilled knowledge is seamlessly fused into the student decoder. Extensive experiments on three challenging remote sensing datasets demonstrate that RS-MTDF consistently achieves state-of-the-art performance. Notably, our method outperforms existing approaches across various label ratios on LoveDA and secures the highest IoU in the majority of semantic categories. These results underscore the efficacy of multi-teacher VFM guidance in significantly enhancing both generalization and semantic understanding for remote sensing segmentation. Ablation studies further validate the contribution of each proposed module.
中文: 提出的RS-MTDF框架通过特征蒸馏利用冻结视觉基础模型作为专家教师,有效弥合半监督遥感分割中的分布差异,在多个数据集上实现了最先进的性能。
English: The proposed RS-MTDF framework leverages frozen Vision Foundation Models as expert teachers through feature distillation to bridge distribution gaps in semi-supervised remote sensing segmentation, achieving state-of-the-art performance across multiple datasets.
Authors:Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Abstract:
Agentic UAVs represent a new frontier in autonomous aerial intelligence, integrating perception, decision-making, memory, and collaborative planning to operate adaptively in complex, real-world environments. Driven by recent advances in Agentic AI, these systems surpass traditional UAVs by exhibiting goal-driven behavior, contextual reasoning, and interactive autonomy. We provide a comprehensive foundation for understanding the architectural components and enabling technologies that distinguish Agentic UAVs from traditional autonomous UAVs. Furthermore, a detailed comparative analysis highlights advancements in autonomy with AI agents, learning, and mission flexibility. This study explores seven high-impact application domains precision agriculture, construction & mining, disaster response, environmental monitoring, infrastructure inspection, logistics, security, and wildlife conservation, illustrating the broad societal value of agentic aerial intelligence. Furthermore, we identify key challenges in technical constraints, regulatory limitations, and data-model reliability, and we present emerging solutions across hardware innovation, learning architectures, and human-AI interaction. Finally, a future roadmap is proposed, outlining pathways toward self-evolving aerial ecosystems, system-level collaboration, and sustainable, equitable deployments. This survey establishes a foundational framework for the future development, deployment, and governance of agentic aerial systems (Agentic UAVs) across diverse societal and industrial domains.
中文: 智能无人机融合感知、决策与协作能力,在农业、救灾等领域展现自主适应性,通过技术创新应对技术限制与监管挑战,为未来自进化空中系统奠定基础。
English: Agentic UAVs integrate advanced AI for autonomous perception, decision-making, and collaboration, surpassing traditional drones in adaptability across diverse applications like agriculture and disaster response, while addressing challenges in technology and regulation for future self-evolving ecosystems.
Authors:Haizhao Jing, Haokui Zhang, Zhenhao Shang, Rong Xiao, Peng Wang, Yanning Zhang
Abstract:
Neural Architecture Representation Learning aims to transform network models into feature representations for predicting network attributes, playing a crucial role in deploying and designing networks for real-world applications. Recently, inspired by the success of transformers, transformer-based models integrated with Graph Neural Networks (GNNs) have achieved significant progress in representation learning. However, current methods still have some limitations. First, existing methods overlook hardware attribute information, which conflicts with the current trend of diversified deep learning hardware and limits the practical applicability of models. Second, current encoding approaches rely on static adjacency matrices to represent topological structures, failing to capture the structural differences between computational nodes, which ultimately compromises encoding effectiveness. In this paper, we introduce LeDG-Former, an innovative framework that addresses these limitations through the synergistic integration of language-based semantic embedding and dynamic graph representation learning. Specifically, inspired by large language models (LLMs), we propose a language embedding framework where both neural architectures and hardware platform specifications are projected into a unified semantic space through tokenization and LLM processing, enabling zero-shot prediction across different hardware platforms for the first time. Then, we propose a dynamic graph-based transformer for modeling neural architectures, resulting in improved neural architecture modeling performance. On the NNLQP benchmark, LeDG-Former surpasses previous methods, establishing a new SOTA while demonstrating the first successful cross-hardware latency prediction capability. Furthermore, our framework achieves superior performance on the cell-structured NAS-Bench-101 and NAS-Bench-201 datasets.
中文:LeDG-Former通过融合基于语言的语义嵌入和动态图表示学习,克服了神经架构表示中的现有局限,不仅实现了最先进的性能,还首次具备了跨硬件平台的延迟预测能力。
English: LeDG-Former introduces a novel framework combining language-based semantic embedding and dynamic graph representation learning to overcome limitations in neural architecture representation, achieving state-of-the-art performance and enabling cross-hardware latency prediction for the first time.
Authors:Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques
Abstract:
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
中文摘要:提出的Self-RedTeam算法通过对抗性自我博弈使语言模型能够自主提升安全性,单个模型在攻击与防御角色间切换,相比静态训练方法实现了更强的鲁棒性。
English Summary: The proposed Self-RedTeam algorithm enables language models to autonomously improve safety through adversarial self-play, where a single model alternates between generating and defending against attacks, achieving stronger robustness than static training methods.
Authors:Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques
Abstract:
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
中文摘要:提出的Self-RedTeam算法通过对抗性自我博弈使语言模型能够自主提升安全性,单个模型在攻击与防御角色间切换,相比静态训练方法实现了更强的鲁棒性。
English Summary: The proposed Self-RedTeam algorithm enables language models to autonomously improve safety through adversarial self-play, where a single model alternates between generating and defending against attacks, achieving stronger robustness than static training methods.
Authors:Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé
Abstract:
Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but ignore readability and modularity. We introduce CTSES, a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior preservation, lexical quality, and structural alignment. CTSES is evaluated on over 5,000 test suites automatically refactored by GPT-4o and Mistral-Large-2407, using Chain-of-Thought prompting, across two established Java benchmarks: Defects4J and SF110. Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.
Chinese: CTSES是一种综合了CodeBLEU、METEOR和ROUGE-L的复合指标,能对LLM重构的单元测试进行更忠实、可解释的评估,比现有指标更符合开发者预期和人类直觉。
English: CTSES is a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to provide balanced evaluations of LLM-refactored unit tests, offering more faithful and interpretable assessments aligned with developer expectations than existing metrics.
Authors:Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé
Abstract:
Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource's Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet its behavior on EvoSuite-generated tests and its applicability to test-specific code structures remain unexplored. We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests. CCTR integrates structural and semantic features like assertion density, annotation roles, and test composition patterns-dimensions ignored by traditional complexity models but critical for understanding test code. We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110. Results show CCTR effectively discriminates between structured and fragmented test suites, producing interpretable scores that better reflect developer-perceived effort. By bridging structural analysis and test readability, CCTR provides a foundation for more reliable evaluation and improvement of generated tests. We publicly release all data, prompts, and evaluation scripts to support replication.
中文: 本文提出CCTR这一专为单元测试设计的测试感知认知复杂度指标,通过整合断言密度、注解角色等结构语义特征,能有效评估生成测试的结构质量并更准确反映开发者的感知工作量。
English: This paper introduces CCTR, a test-aware cognitive complexity metric designed specifically for unit tests, which effectively evaluates structural and semantic features like assertion density and test patterns to better reflect developer-perceived effort in generated tests.
Authors:Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Juntao Dai, Yaodong Yang, Sirui Han, Yike Guo
Abstract:
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs' safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8\%. We urge the community to prioritize research on the safety of LLMs.
中文: SafeLawBench基准通过法律视角构建了系统化的大语言模型安全评估框架,将风险分为三级并进行大规模测试,结果显示即使Claude-3.5-Sonnet和GPT-4o等顶尖模型在多选题准确率也未超过80.5%,20个模型的平均准确率仅为68.8%。
English: The SafeLawBench benchmark introduces a legal framework to systematically evaluate large language model safety through multi-level risk categorization and extensive testing, revealing that even top models like Claude-3.5-Sonnet and GPT-4o fall short of 80.5% accuracy, with an average performance of 68.8% across 20 models.
Authors:Yuan Xun, Siyuan Liang, Xiaojun Jia, Xinwei Liu, Xiaochun Cao
Abstract:
Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework, Robust Instruction Tuning, that finetunes only adapter modules and text embedding layers under instruction tuning. Our method integrates two complementary regularizations: (1) Input Diversity Regularization, which perturbs trigger components across training samples to disrupt consistent spurious cues; and (2) Anomalous Activation Regularization, which dynamically sparses adapter weights exhibiting abnormally sharp activations linked to backdoor patterns. These mechanisms jointly guide the model toward learning semantically grounded representations rather than memorizing superficial trigger-response mappings.
Extensive experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours
reduces their attack success rate to nearly zero, with an increase in training cost of less than 15%.
中文摘要:本文提出鲁棒指令调优框架,通过仅微调适配器模块和文本嵌入层,并结合输入多样性和异常激活正则化机制,有效防御大型视觉语言模型的后门攻击,使其无法记忆触发模式。
English Summary: This paper introduces Robust Instruction Tuning, a lightweight defense framework that protects large visual language models from backdoor attacks by fine-tuning only adapter modules and text embedding layers while incorporating input diversity and anomalous activation regularizations to prevent trigger memorization.
Authors:Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis
Abstract:
Agentic AI systems, built upon large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligence, autonomy, collaboration, and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based Agentic Multi-Agent Systems (AMAS). We begin by examining the conceptual foundations of Agentic AI and highlight its architectural distinctions from traditional AI agents. We then adapt and extend the AI TRiSM framework for Agentic AI, structured around key pillars: \textit{ Explainability, ModelOps, Security, Privacy} and \textit{their Lifecycle Governance}, each contextualized to the challenges of AMAS. A risk taxonomy is proposed to capture the unique threats and vulnerabilities of Agentic AI, ranging from coordination failures to prompt-based adversarial manipulation. To support practical assessment in Agentic AI works, we introduce two novel metrics: the Component Synergy Score (CSS), which quantifies the quality of inter-agent collaboration, and the Tool Utilization Efficacy (TUE), which evaluates the efficiency of tool use within agent workflows. We further discuss strategies for improving explainability in Agentic AI, as well as approaches to enhancing security and privacy through encryption, adversarial robustness, and regulatory compliance. The review concludes with a research roadmap for the responsible development and deployment of Agentic AI, highlighting key directions to align emerging systems with TRiSM principles-ensuring safety, transparency, and accountability in their operation.
中文摘要:本综述针对基于大语言模型的智能多代理系统,提出了扩展的AI信任、风险与安全管理框架,引入新型评估指标并探讨可解释性与安全增强策略,为负责任开发提供研究路线图。
English Summary: This review adapts the AI Trust, Risk, and Security Management framework for Agentic AI systems, proposing risk taxonomies and novel metrics while outlining strategies for explainability, security, and responsible development.
Authors:Xiao Xiang Zhu, Sining Chen, Fahong Zhang, Yilei Shi, Yuanyuan Wang
Abstract:
We introduce GlobalBuildingAtlas, a publicly available dataset providing global and complete coverage of building polygons, heights and Level of Detail 1 (LoD1) 3D building models. This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D form at the individual building level on a global scale. Towards this dataset, we developed machine learning-based pipelines to derive building polygons and heights (called GBA.Height) from global PlanetScope satellite data, respectively. Also a quality-based fusion strategy was employed to generate higher-quality polygons (called GBA.Polygon) based on existing open building polygons, including our own derived one. With more than 2.75 billion buildings worldwide, GBA.Polygon surpasses the most comprehensive database to date by more than 1 billion buildings. GBA.Height offers the most detailed and accurate global 3D building height maps to date, achieving a spatial resolution of 3x3 meters-30 times finer than previous global products (90 m), enabling a high-resolution and reliable analysis of building volumes at both local and global scales. Finally, we generated a global LoD1 building model (called GBA.LoD1) from the resulting GBA.Polygon and GBA.Height. GBA.LoD1 represents the first complete global LoD1 building models, including 2.68 billion building instances with predicted heights, i.e., with a height completeness of more than 97%, achieving RMSEs ranging from 1.5 m to 8.9 m across different continents. With its height accuracy, comprehensive global coverage and rich spatial details, GlobalBuildingAltas offers novel insights on the status quo of global buildings, which unlocks unprecedented geospatial analysis possibilities, as showcased by a better illustration of where people live and a more comprehensive monitoring of the progress on the 11th Sustainable Development Goal of the United Nations.
中文: GlobalBuildingAtlas是首个提供全球完整二维和三维建筑数据的公开数据集,包含超过27.5亿栋建筑多边形和高精度三维模型,为可持续城市发展监测等应用开启了前所未有的地理空间分析可能。
English: GlobalBuildingAtlas is the first open dataset offering comprehensive 2D and 3D building data globally, featuring over 2.75 billion building polygons and high-resolution height maps that enable detailed geospatial analysis for sustainable development monitoring.
Authors:Xunzhu Tang, Jacques Klein, Tegawendé F. Bissyandé
Abstract:
Several closed-source LLMs have consistently outperformed open-source alternatives in program repair tasks, primarily due to their superior reasoning capabilities and extensive pre-training. This paper introduces Repairity, a novel three-stage methodology that significantly narrows this performance gap through reasoning extraction and reinforcement learning. Our approach: (1) systematically filters high-quality reasoning traces from closed-source models using correctness verification, (2) transfers this reasoning knowledge to open-source models via supervised fine-tuning, and (3) develops reinforcement learning with LLM-based feedback to further optimize performance. Empirical evaluation across multiple program repair benchmarks demonstrates that Repairity improves the performance of Qwen2.5-Coder-32B-Instruct, a base open source LLM, by 8.68\% on average, reducing the capability gap with Claude-Sonnet3.7, a state-of-the-art closed-source model, from 10.05% to 1.35%. Ablation studies confirm that both reasoning extraction and LLM-guided reinforcement learning contribute significantly to these improvements. Our methodology generalizes effectively to additional code-related tasks, enabling organizations to leverage high-quality program repair capabilities while maintaining the customizability, transparency, and deployment flexibility inherent to open-source models.
Chinese: 本文提出Repairity方法,通过从闭源模型中提取推理知识并结合强化学习,显著缩小了开源与闭源大语言模型在程序修复任务中的性能差距,使开源模型平均性能提升8.68%。
English: This paper introduces Repairity, a three-stage method that narrows the performance gap between open-source and closed-source LLMs in program repair by extracting reasoning from closed-source models and using reinforcement learning, achieving an average 8.68% improvement in open-source models.
Authors:Sebastian Rödling, Matej ZeÄeviÄ, Devendra Singh Dhami, Kristian Kersting
Abstract:
Structural Causal Explanations (SCEs) can be used to automatically generate explanations in natural language to questions about given data that are grounded in a (possibly learned) causal model. Unfortunately they work for small data only. In turn they are not attractive to offer reasons for events, e.g., tracking causal changes over multiple time steps, or a behavioral component that involves feedback loops through actions of an agent. To this end, we generalize SCEs to a (recursive) formulation of explanation trees to capture the temporal interactions between reasons. We show the benefits of this more general SCE algorithm on synthetic time-series data and a 2D grid game, and further compare it to the base SCE and other existing methods for causal explanations.
Chinese: 该研究将结构因果解释推广为递归解释树,以处理时间交互和反馈循环问题,并在合成时间序列数据和二维网格游戏中验证了其相较于基线方法的优越性能。
English: The paper generalizes Structural Causal Explanations (SCEs) into recursive explanation trees to address temporal interactions and feedback loops, demonstrating improved performance on synthetic time-series data and a 2D grid game compared to baseline methods.
Authors:Caiyi Sun, Yujing Sun, Xiao Han, Zemin Yang, Jiawei Liu, Xinge Zhu, Siu Ming Yiu, Yuexin Ma
Abstract:
Complex scenes present significant challenges for predicting human behaviour due to the abundance of interaction information, such as human-human and humanenvironment interactions. These factors complicate the analysis and understanding of human behaviour, thereby increasing the uncertainty in forecasting human motions. Existing motion prediction methods thus struggle in these complex scenarios. In this paper, we propose an effective method for human motion forecasting in interactive scenes. To achieve a comprehensive representation of interactions, we design a hierarchical interaction feature representation so that high-level features capture the overall context of the interactions, while low-level features focus on fine-grained details. Besides, we propose a coarse-to-fine interaction reasoning module that leverages both spatial and frequency perspectives to efficiently utilize hierarchical features, thereby enhancing the accuracy of motion predictions. Our method achieves state-of-the-art performance across four public datasets. Code will be released when this paper is published.
中文: 本文提出了一种有效的人类运动预测方法,通过分层交互特征和由粗到精的推理模块,在复杂交互场景中提高了预测准确性,并在多个公开数据集上达到了最优性能。
English: This paper introduces an effective human motion forecasting method that uses hierarchical interaction features and a coarse-to-fine reasoning module to improve prediction accuracy in complex interactive scenes, achieving state-of-the-art results on multiple datasets.
Authors:Zhiyuan Yu, Zhe Li, Hujun Bao, Can Yang, Xiaowei Zhou
Abstract:
3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at https://zju3dv.github.io/humanram/.
中文摘要:HumanRAM是一种新颖的前馈方法,通过显式姿态条件和可扩展变换器将人体重建与动画统一起来,仅需单目或稀疏图像输入即可实现优于现有方法的重建精度与动画保真度。
English Summary: HumanRAM is a novel feed-forward approach that unifies human reconstruction and animation using explicit pose conditions and scalable transformers, achieving superior accuracy and fidelity over previous methods from monocular or sparse inputs.
Authors:Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu
Abstract:
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks, yet their deployment in real-world applications is hindered by fixed knowledge cutoffs and difficulties in generating controllable, accurate outputs in a single inference. Multi-agent systems (MAS) built from specialized LLM agents offer a promising solution, enabling dynamic collaboration and iterative reasoning. However, optimizing these systems remains a challenge, as conventional methods such as prompt engineering and supervised fine-tuning entail high engineering overhead and limited adaptability. Reinforcement learning (RL), particularly multi-agent reinforcement learning (MARL), provides a scalable framework by refining agent policies based on system-level feedback. Nevertheless, existing MARL algorithms, such as Multi-Agent Proximal Policy Optimization (MAPPO), rely on Critic networks, which can cause training instability and increase computational burden. To address these limitations and target the prototypical Multi-Agent Search System (MASS), we propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), a novel Critic-free algorithm that guides policy updates by estimating relative reward advantages across heterogeneous groups of rollouts. MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead. Additionally, we introduce three group rollout sampling strategies that trade off between efficiency and effectiveness. Experiments on a multi-agent LLM-based search system demonstrate that MHGPO consistently outperforms MAPPO in both task performance and computational efficiency, without requiring warm-up, underscoring its potential for stable and scalable optimization of complex LLM-based MAS.
中文摘要:提出的多智能体异质群体策略优化(MHGPO)算法通过消除评论家网络,解决了现有多智能体强化学习方法的局限性,从而提高了基于大语言模型的多智能体系统的训练稳定性和计算效率。
English Summary: The proposed Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) algorithm addresses limitations of existing multi-agent reinforcement learning methods by eliminating Critic networks, thereby improving training stability and computational efficiency in large language model-based multi-agent systems.
Authors:Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel
Abstract:
As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics like public health where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI truthfulness by enabling humans to supervise systems that may exceed human capabilities--yet humans themselves hold different beliefs and biases that impair their judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial COVID-19 factuality claims where people hold strong prior beliefs. We conduct two studies: one with human judges holding either mainstream or skeptical beliefs evaluating factuality claims through AI-assisted debate or consultancy protocols, and a second examining the same problem with personalized AI judges designed to mimic these different human belief systems. In our human study, we find that debate-where two AI advisor systems present opposing evidence-based arguments-consistently improves judgment accuracy and confidence calibration, outperforming consultancy with a single-advisor system by 10% overall. The improvement is most significant for judges with mainstream beliefs (+15.2% accuracy), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In our AI judge study, we find that AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight--leveraging both diverse human and AI judgments to move closer to truth in contested domains.
中文: AI辩论通过让两个系统就争议性主张进行对立辩论,显著提升了判断准确性并优化了信心校准,尤其对主流信念持有者效果明显,为在公共卫生等争议领域实现可扩展的偏见弹性监督提供了可行路径。
English: AI debate between opposing systems significantly improves judgment accuracy and confidence calibration, particularly for mainstream-belief holders, offering a scalable oversight method that leverages diverse human and AI judgments to approach truth in contested domains like public health.
Authors:Timothy Do, Pranav Saran, Harshita Poojary, Pranav Prabhu, Sean O'Brien, Vasu Sharma, Kevin Zhu
Abstract:
In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model's efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.
中文: 本文提出了一种结合多语言BERT、双向LSTM和线性分类器的混合模型,通过基于梯度的注意力头剪枝技术,在孔卡尼语等低资源语言的隐喻分类任务中达到78%准确率,在习语分类中达到83%准确率。
English: This paper introduces a hybrid model combining mBERT with a bidirectional LSTM and linear classifier, enhanced by gradient-based attention head pruning, achieving 78% accuracy in metaphor classification and 83% in idiom classification for low-resource languages like Konkani.
Authors:Abhay Gupta, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
Abstract:
Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate seven state-of-the-art models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We additionally present retrieval-augmented generation (RAG) evaluations to test model performance when only selected passages are provided instead of the full context. We noticed consistent accuracy drops with increased hops and context length increase, even for frontier models-revealing that sheer scale does not guarantee robust reasoning. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to test multi-hop reasoning at scale. All code and datasets are available at https://novelhopqa.github.io.
中文:NovelHopQA基准通过小说节选评估长文本中的多跳推理能力,发现即使先进模型在上下文长度和推理深度增加时准确率也会下降,尽管所有问题均可回答。
English: The NovelHopQA benchmark evaluates multi-hop reasoning in long-context scenarios using novel excerpts, revealing that even advanced models struggle with accuracy as context length and reasoning depth increase, despite all questions being answerable.
Authors:Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert
Abstract:
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
中文摘要:RewardBench 2是一个采用全新人类提示构建的奖励模型评估基准,相比前代基准得分显著降低,但与下游任务性能保持高度相关性。
English Summary: RewardBench 2 is a challenging new benchmark that evaluates reward models using original human prompts, showing significantly lower scores than its predecessor while maintaining strong correlation with downstream task performance.
Authors:Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, Tong Zhang
Abstract:
Supervised fine-tuning (SFT) on domain-specific data is the dominant approach for adapting foundation models to specialized tasks. However, it has been observed that SFT models tend to forget knowledge acquired during pretraining. In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue. In this work, we demonstrate that the same holds for language models, and, more strikingly, we observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself. Despite the empirical success of ensembling, a theoretical understanding of its benefits remains underexplored. We develop a formal theoretical analysis of the overadaptation phenomenon. Ensembling mitigates this by balancing two primary sources of error: bias, caused by insufficient fine-tuning, and variance, introduced by overfitting to fine-tuning data. While regularization techniques aim to address this trade-off, we show that ensembling provides a more effective solution. We analyze this phenomenon in over-parameterized linear settings and demonstrate that interpolating between pretrained and fine-tuned weights significantly improves performance. These findings offer theoretical justification for the observed advantages of model ensembling, supported by empirical experiments consistent with our analysis.
中文摘要:将预训练语言模型与其微调版本集成,通过平衡偏差和方差缓解了知识遗忘和过度适应问题,即使在微调领域也优于单独微调模型,同时保留了通用知识。
English Summary: Ensembling a pretrained language model with its fine-tuned version mitigates knowledge forgetting and overadaptation by balancing bias and variance, outperforming the fine-tuned model even on its own domain while preserving general knowledge.
Authors:Yiming Zhong, Yumeng Liu, Chuyang Xiao, Zemin Yang, Youzhuo Wang, Yufei Zhu, Ye Shi, Yujing Sun, Xinge Zhu, Yuexin Ma
Abstract:
Learning effective visuomotor policies for robotic manipulation is challenging, as it requires generating precise actions while maintaining computational efficiency. Existing methods remain unsatisfactory due to inherent limitations in the essential action representation and the basic network architectures. We observe that representing actions in the frequency domain captures the structured nature of motion more effectively: low-frequency components reflect global movement patterns, while high-frequency components encode fine local details. Additionally, robotic manipulation tasks of varying complexity demand different levels of modeling precision across these frequency bands. Motivated by this, we propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components. To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space. Extensive experiments across diverse 2D and 3D robotic manipulation benchmarks demonstrate that our approach outperforms existing methods in both accuracy and efficiency, showcasing the potential of a frequency-domain autoregressive framework with continuous tokens for generalized robotic manipulation.
Chinese: 本文提出了一种频域自回归框架,通过连续潜在表示渐进建模分层频率分量,有效提升了机器人视觉运动策略在操作任务中的精确度和效率。
English: This paper introduces a frequency-domain autoregressive framework with continuous latent representations for robotic visuomotor policy learning, which progressively models hierarchical frequency components to enhance both accuracy and efficiency in manipulation tasks.
Authors:Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu
Abstract:
Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.
中文: 机器遗忘通过选择性消除知识解决大语言模型的隐私与安全问题,但现有方法易受下游微调影响,因此引入不变性大语言模型遗忘(ILU)框架,显著提升鲁棒性并在多样化任务中实现卓越性能。
English: Machine unlearning addresses privacy and safety in LLMs by selectively removing knowledge, but current methods are vulnerable to downstream fine-tuning, prompting the introduction of invariant LLM unlearning (ILU) for enhanced robustness and superior performance across diverse tasks.
Authors:Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu
Abstract:
Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.
Chinese: 本研究提出了Sarc7基准,用于对七种讽刺类型进行分类,并证明在大型语言模型中使用基于情感的提示方法,相比传统技术能显著提升分类和生成效果。
English: The study introduces Sarc7, a benchmark for classifying seven types of sarcasm, and demonstrates that emotion-based prompting in large language models like Gemini 2.5 significantly improves classification and generation performance over traditional methods.
Authors:Arjun Prasaath Anbazhagan, Parteek Kumar, Ujjwal Kaur, Aslihan Akalin, Kevin Zhu, Sean O'Brien
Abstract:
How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM-generated audio can lead to an improvement in the performance of text-based LLMs in generating audio.
中文: 本研究通过代码作为中介探索大语言模型如何根据文本提示生成音频,发现尽管模型能理解基本听觉概念,但随着音频复杂度增加其生成能力会下降,需进一步改进。
English: This study explores how large language models (LLMs) can generate audio through text prompts by using code as an intermediary, finding that while they grasp basic auditory concepts, their audio generation ability declines with complexity and requires further enhancement.
Authors:Thanh-Tung Phan-Nguyen, Khoi-Nguyen Nguyen-Ngoc, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
The global fashion e-commerce industry has become integral to people's daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers' experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers' experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
中文摘要:KiseKloset系统通过先进的套装检索、推荐和实时虚拟试穿功能提升在线时尚购物体验,用户测试显示84%的参与者认为该系统显著改善了购物满意度。
English Summary: The KiseKloset system enhances online fashion shopping through advanced outfit retrieval, recommendation, and a real-time virtual try-on module, with user testing showing 84% satisfaction for its effectiveness in improving the shopping experience.
Authors:Ngoc-Do Tran, Minh-Tuan Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
The rapid advancement of AI and computer vision has significantly increased the demand for high-quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource-intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real-world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user-friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user-friendly interface allows users to quickly customize their data pipelines through drag-and-drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real-world applications.
中文: 人工智能和计算机视觉的快速发展加大了对高质量标注数据集的需求,为此我们推出了SynthLab,这是一个模块化平台,配有用户友好的界面,能够高效合成和定制数据,使非专业用户也能轻松应用AI技术。
English: The rapid advancement of AI and computer vision has heightened the need for high-quality annotated datasets, leading to the development of SynthLab, a modular platform with a user-friendly interface that enables efficient data synthesis and customization, making AI accessible to users without deep technical expertise.
Authors:Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao
Abstract:
This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
中文: 本研究提出一种自监督方法,通过图像三元组和基于规则的强化学习训练视觉语言模型进行思维链推理,使其无需人工标注即可从视觉比较泛化至复杂多图像及通用视觉任务。
English: This study develops a self-supervised method using image triplets and rule-based reinforcement learning to train Vision-Language Models for Chain-of-Thought reasoning, enabling them to generalize from visual comparisons to complex multi-image and general vision tasks without human annotations.
Authors:Minh-Loi Nguyen, Quang-Khai Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users' facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users' faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system's real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.
Chinese: TaleForge是一个个性化故事生成系统,它融合大语言模型与文生图扩散技术,将用户面部图像嵌入叙事和插画,通过故事生成、角色定制和背景创作三大联动模块提升用户参与感与归属感。
English: TaleForge is a personalized story-generation system that combines large language models and text-to-image diffusion to integrate users' facial images into narratives and illustrations, enhancing engagement and ownership through three interconnected modules for story, character, and background creation.
Authors:Peihao Wang, Zhangyang Wang
Abstract:
We develop a theoretical framework that explains how discrete symbolic structures can emerge naturally from continuous neural network training dynamics. By lifting neural parameters to a measure space and modeling training as Wasserstein gradient flow, we show that under geometric constraints, such as group invariance, the parameter measure $μ_t$ undergoes two concurrent phenomena: (1) a decoupling of the gradient flow into independent optimization trajectories over some potential functions, and (2) a progressive contraction on the degree of freedom. These potentials encode algebraic constraints relevant to the task and act as ring homomorphisms under a commutative semi-ring structure on the measure space. As training progresses, the network transitions from a high-dimensional exploration to compositional representations that comply with algebraic operations and exhibit a lower degree of freedom. We further establish data scaling laws for realizing symbolic tasks, linking representational capacity to the group invariance that facilitates symbolic solutions. This framework charts a principled foundation for understanding and designing neurosymbolic systems that integrate continuous learning with discrete algebraic reasoning.
中文摘要:该理论框架通过几何约束和梯度流揭示了离散符号结构如何从连续神经网络训练中自然涌现,形成符合代数运算的组合表征并降低自由度。
English Summary: This theoretical framework demonstrates how discrete symbolic structures emerge from continuous neural training through geometric constraints and gradient flow, leading to compositional representations and reduced degrees of freedom.
Authors:Duc-Hung Nguyen, Huu-Phuc Huynh, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.
中文: GenFlow作为突破性模块化框架,通过节点编辑器和智能助手降低技术门槛,使不同水平用户都能高效创作生成艺术,其直观界面与自适应功能显著提升了工作流效率与可及性。
English: GenFlow is a modular framework that democratizes generative art creation through a node-based editor and AI assistant, enabling users of all skill levels to produce images efficiently while reducing technical barriers and workflow complexity.
Authors:Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.
中文摘要:SAMURAI框架通过结合CLIP语义匹配与基于形状的重新排序及多数投票策略,有效解决了复杂室内环境中三维物体检索的难题,实现了语言理解与形状先验的协同优化。
English Summary: The proposed SAMURAI framework addresses 3D object retrieval challenges by integrating CLIP-based semantic matching with shape-guided re-ranking and majority voting, achieving competitive performance through combined language and shape cues.
Authors:Lam-Huy Nguyen, Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.
中文: VisionGuard是一个多阶段框架,通过提升分类一致性和解决数据不平衡问题,有效检测头盔违规行为,使mAP提高3.1%,从而增强交通安全。
English: VisionGuard is a multi-stage framework that enhances helmet violation detection by improving classification consistency and addressing data imbalance, achieving a 3.1% mAP increase for better traffic safety.
Authors:Quoc-Duy Tran, Anh-Tuan Vo, Dinh-Khoi Vo, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io
中文:Shape2Animal框架通过视觉语言模型和扩散模型自动将自然物体轮廓重新诠释为动物形态,模拟了人类的空想性错视现象,为视觉叙事和数字艺术开辟了创新应用前景。
English: The Shape2Animal framework mimics human pareidolia by automatically reinterpreting natural object silhouettes as animal forms using vision-language and diffusion models, demonstrating creative applications in visual storytelling and digital art.
Authors:Quang-Binh Nguyen, Trong-Vu Hoang, Ngoc-Do Tran, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
While the efficacy of deep learning models heavily relies on data, gathering and annotating data for specific tasks, particularly when addressing novel or sensitive subjects lacking relevant datasets, poses significant time and resource challenges. In response to this, we propose a novel Automated Image Recognition (AIR) framework that harnesses the power of generative AI. AIR empowers end-users to synthesize high-quality, pre-annotated datasets, eliminating the necessity for manual labeling. It also automatically trains deep learning models on the generated datasets with robust image recognition performance. Our framework includes two main data synthesis processes, AIR-Gen and AIR-Aug. The AIR-Gen enables end-users to seamlessly generate datasets tailored to their specifications. To improve image quality, we introduce a novel automated prompt engineering module that leverages the capabilities of large language models. We also introduce a distribution adjustment algorithm to eliminate duplicates and outliers, enhancing the robustness and reliability of generated datasets. On the other hand, the AIR-Aug enhances a given dataset, thereby improving the performance of deep classifier models. AIR-Aug is particularly beneficial when users have limited data for specific tasks. Through comprehensive experiments, we demonstrated the efficacy of our generated data in training deep learning models and showcased the system's potential to provide image recognition models for a wide range of objects. We also conducted a user study that achieved an impressive score of 4.4 out of 5.0, underscoring the AI community's positive perception of AIR.
中文:本文提出的自动图像识别(AIR)框架利用生成式AI自动创建高质量预标注数据集并训练深度学习模型,通过实验验证和用户研究证明其能有效解决数据稀缺与标注难题,且获得业界积极认可。
English: The proposed Automated Image Recognition (AIR) framework utilizes generative AI to automatically create high-quality, pre-annotated datasets and train deep learning models, effectively addressing data scarcity and annotation challenges while demonstrating strong performance through experiments and positive user feedback.
Authors:Hassan S. Al Khatib, Subash Neupane, Sudip Mittal, Shahram Rahimi, Nina Marhamati, Sean Bozorgzad
Abstract:
The healthcare industry is moving towards a patient-centric paradigm that requires advanced methods for managing and representing patient data. This paper presents a Patient Journey Ontology (PJO), a framework that aims to capture the entirety of a patient's healthcare encounters. Utilizing ontologies, the PJO integrates different patient data sources like medical histories, diagnoses, treatment pathways, and outcomes; it enables semantic interoperability and enhances clinical reasoning. By capturing temporal, sequential, and causal relationships between medical encounters, the PJO supports predictive analytics, enabling earlier interventions and optimized treatment plans. The ontology's structure, including its main classes, subclasses, properties, and relationships, as detailed in the paper, demonstrates its ability to provide a holistic view of patient care. Quantitative and qualitative evaluations by Subject Matter Experts (SMEs) demonstrate strong capabilities in patient history retrieval, symptom tracking, and provider interaction representation, while identifying opportunities for enhanced diagnosis-symptom linking. These evaluations reveal the PJO's reliability and practical applicability, demonstrating its potential to enhance patient outcomes and healthcare efficiency. This work contributes to the ongoing efforts of knowledge representation in healthcare, offering a reliable tool for personalized medicine, patient journey analysis and advancing the capabilities of Generative AI in healthcare applications.
中文: 本文提出了一种患者旅程本体(PJO),通过整合多样化的患者数据实现语义互操作性,支持预测性分析并增强临床推理,从而提升医疗效果和效率。
English: This paper introduces a Patient Journey Ontology (PJO) that integrates diverse patient data to enable semantic interoperability, support predictive analytics, and enhance clinical reasoning for improved healthcare outcomes and efficiency.
Authors:Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.
中文摘要:ShowFlow是一个综合框架,包含用于单概念图像生成的ShowFlow-S(采用KronA-WED适配器和解耦学习)和用于多概念生成的ShowFlow-M(直接复用训练模型并配备即插即用SAMA模块),有效解决了可控图像合成中的身份保持与提示对齐难题。
English Summary: ShowFlow is a comprehensive framework that introduces ShowFlow-S for single-concept image generation using a KronA-WED adapter and disentangled learning, and ShowFlow-M for multi-concept generation which reuses learned models with a plug-and-play SAMA module, effectively addressing identity preservation and prompt alignment challenges in controllable image synthesis.
Authors:Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.
中文: 本文提出CPAM零样本框架,通过自适应注意力机制和遮罩引导策略,在复杂非刚性图像编辑中有效保持物体纹理与背景细节。
English: This paper introduces CPAM, a zero-shot framework that effectively preserves object textures and background details during complex non-rigid image editing through adaptive attention mechanisms and mask guidance strategies.
Authors:Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract:
We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.
中文: OpenEvents V1是一个大规模事件导向的视觉语言数据集,包含20万篇新闻文章和40万张图片,旨在通过事件感知描述和跨模态检索等任务推动多模态人工智能系统的发展。
English: OpenEvents V1 is a large-scale event-centric vision-language dataset featuring over 200,000 news articles and 400,000 images, designed for tasks like event-aware captioning and cross-modal retrieval to advance multimodal AI systems.
Authors:Bo Pan, Yang Chen, Yingwei Pan, Ting Yao, Wei Chen, Tao Mei
Abstract:
Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: https://dream-journey.vercel.app.
中文: DreamJourney提出一个两阶段框架,利用视频扩散模型实现永续动态场景视图生成,通过结合相机运动和物体动态来克服先前方法的局限,显著提升了三维一致性与视觉质量。
English: DreamJourney introduces a two-stage framework that leverages video diffusion models to generate perpetual dynamic scene views, addressing limitations of previous methods by incorporating both camera movements and object dynamics for enhanced 3D consistency and visual quality.
Authors:Ruoyu Wang, Tong Yu, Junda Wu, Yao Liu, Julian McAuley, Lina Yao
Abstract:
Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent's ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent's ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validate the effectiveness, robustness and generalizability of our method.
中文: 提出的弱监督部分对比学习方法无需微调即可将预训练视觉语言模型知识融入导航过程,有效增强智能体在动态视角下的物体识别能力,在多个基准测试中展现出优越性能和计算效率。
English: The proposed Weakly-supervised Partial Contrastive Learning (WPCL) method enhances visual navigation agents' object recognition from dynamic viewpoints by integrating pre-trained vision-language model knowledge without fine-tuning, achieving superior performance and computational efficiency across multiple benchmarks.
Authors:Lulu Xue, Shengshan Hu, Wei Lu, Yan Shen, Dongxu Li, Peijin Guo, Ziqi Zhou, Minghui Li, Yanjun Zhang, Leo Yu Zhang
Abstract:
With growing demands for privacy protection, security, and legal compliance (e.g., GDPR), machine unlearning has emerged as a critical technique for ensuring the controllability and regulatory alignment of machine learning models. However, a fundamental challenge in this field lies in effectively verifying whether unlearning operations have been successfully and thoroughly executed. Despite a growing body of work on unlearning techniques, verification methodologies remain comparatively underexplored and often fragmented. Existing approaches lack a unified taxonomy and a systematic framework for evaluation. To bridge this gap, this paper presents the first structured survey of machine unlearning verification methods. We propose a taxonomy that organizes current techniques into two principal categories -- behavioral verification and parametric verification -- based on the type of evidence used to assess unlearning fidelity. We examine representative methods within each category, analyze their underlying assumptions, strengths, and limitations, and identify potential vulnerabilities in practical deployment. In closing, we articulate a set of open problems in current verification research, aiming to provide a foundation for developing more robust, efficient, and theoretically grounded unlearning verification mechanisms.
中文摘要:本文针对机器遗忘中的验证难题,首次提出结构化分类法将现有方法分为行为验证和参数验证两类,系统分析其优劣与局限,并指明未来研究方向。
English Summary: This paper addresses the verification challenges in machine unlearning by proposing the first structured taxonomy that categorizes methods into behavioral and parametric verification, while analyzing their strengths, limitations, and open research problems.
Authors:Xi Chen, Hengshuang Zhao
Abstract:
Interactive segmentation enables users to extract binary masks of target objects through simple interactions such as clicks, scribbles, and boxes. However, existing methods often support only limited interaction forms and struggle to capture fine details. In this paper, we revisit the classical coarse-to-fine design of FocalClick and introduce significant extensions. Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL, to address these challenges simultaneously. Following the emerging trend of large-scale pretraining, we decompose interactive segmentation into meta-tasks that capture different levels of information -- context, object, and detail -- assigning a dedicated subnet to each level.This decomposition allows each subnet to undergo scaled pretraining with independent data and supervision, maximizing its effectiveness. To enhance flexibility, we share context- and detail-level information across different interaction forms as common knowledge while introducing a prompting layer at the object level to encode specific interaction types. As a result, FocalClick-XL achieves state-of-the-art performance on click-based benchmarks and demonstrates remarkable adaptability to diverse interaction formats, including boxes, scribbles, and coarse masks. Beyond binary mask generation, it is also capable of predicting alpha mattes with fine-grained details, making it a versatile and powerful tool for interactive segmentation.
中文:FocalClick-XL提出了一种新颖的流程,通过为上下文、对象和细节级别分配专用子网络,在交互式分割中实现了最先进的性能,并能灵活适应多种交互形式,同时生成精细的阿尔法蒙版。
English: FocalClick-XL introduces a novel pipeline with dedicated subnets for context, object, and detail levels, achieving state-of-the-art performance in interactive segmentation and adapting flexibly to various interaction forms while producing fine-grained alpha mattes.
Authors:Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, Libo Qin
Abstract:
Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.
中文: 本文提出了首个多库调试基准MLDebugging,涵盖126个Python库和七类问题,评估发现当前大语言模型在真实多库场景下的代码调试能力仍有不足。
English: This paper introduces MLDebugging, a benchmark for evaluating code debugging in multi-library Python environments, revealing that current large language models still struggle with such complex scenarios despite covering 126 libraries and seven issue types.
Authors:Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, Samir Tulebaev, Hae Won Park
Abstract:
Current large language models (LLMs), despite their power, can introduce safety risks in clinical settings due to limitations such as poor error detection and single point of failure. To address this, we propose Tiered Agentic Oversight (TAO), a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse, physician, specialist), TAO conducts agent routing based on task complexity and agent roles. Leveraging automated inter- and intra-tier collaboration and role-playing, TAO creates a robust safety framework. Ablation studies reveal that TAO's superior performance is driven by its adaptive tiered architecture, which improves safety by over 3.2% compared to static single-tier configurations; the critical role of its lower tiers, particularly tier 1, whose removal most significantly impacts safety; and the strategic assignment of more advanced LLM to these initial tiers, which boosts performance by over 2% compared to less optimal allocations while achieving near-peak safety efficiently. These mechanisms enable TAO to outperform single-agent and multi-agent frameworks in 4 out of 5 healthcare safety benchmarks, showing up to an 8.2% improvement over the next-best methods in these evaluations. Finally, we validate TAO via an auxiliary clinician-in-the-loop study where integrating expert feedback improved TAO's accuracy in medical triage from 40% to 60%.
中文: 分层智能体监督(TAO)是一种层级式多智能体系统,通过将任务分配给专业智能体来提升临床环境中的AI安全性,可吸收高达24%的个体错误,并在医疗安全基准测试中优于其他系统。
English: Tiered Agentic Oversight (TAO) is a hierarchical multi-agent system that enhances AI safety in clinical settings by routing tasks through specialized agents, reducing individual agent errors by up to 24% and outperforming other systems on healthcare safety benchmarks.
Authors:Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, Samir Tulebaev, Hae Won Park
Abstract:
Large language models (LLMs) deployed as agents introduce significant safety risks in clinical settings due to their potential for error and single points of failure. We introduce Tiered Agentic Oversight (TAO), a hierarchical multi-agent system that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse-physician-specialist) in hospital, TAO routes tasks to specialized agents based on complexity, creating a robust safety framework through automated inter- and intra-tier communication and role-playing. Crucially, this hierarchical structure functions as an effective error-correction mechanism, absorbing up to 24% of individual agent errors before they can compound. Our experiments reveal TAO outperforms single-agent and other multi-agent systems on 4 out of 5 healthcare safety benchmarks, with up to an 8.2% improvement. Ablation studies confirm key design principles of the system: (i) its adaptive architecture is over 3% safer than static, single-tier configurations, and (ii) its lower tiers are indispensable, as their removal causes the most significant degradation in overall safety. Finally, we validated the system's synergy with human doctors in a user study where a physician, acting as the highest tier agent, provided corrective feedback that improved medical triage accuracy from 40% to 60%. Project Page: https://tiered-agentic-oversight.github.io/
中文: 分层智能体监督(TAO)是一种层级式多智能体系统,通过将任务分配给专业智能体来提升临床环境中的AI安全性,可吸收高达24%的个体错误,并在医疗安全基准测试中优于其他系统。
English: Tiered Agentic Oversight (TAO) is a hierarchical multi-agent system that enhances AI safety in clinical settings by routing tasks through specialized agents, reducing individual agent errors by up to 24% and outperforming other systems on healthcare safety benchmarks.
Authors:Rongpeng Li, Jianhang Zhu, Jiahao Huang, Zhifeng Zhao, Honggang Zhang
Abstract:
Intelligent Transportation Systems (ITSs) have emerged as a promising solution towards ameliorating urban traffic congestion, with Traffic Signal Control (TSC) identified as a critical component. Although Multi-Agent Reinforcement Learning (MARL) algorithms have shown potential in optimizing TSC through real-time decision-making, their scalability and effectiveness often suffer from large-scale and complex environments. Typically, these limitations primarily stem from a fundamental mismatch between the exponential growth of the state space driven by the environmental heterogeneities and the limited modeling capacity of current solutions. To address these issues, this paper introduces a novel MARL framework that integrates Dynamic Graph Neural Networks (DGNNs) and Topological Data Analysis (TDA), aiming to enhance the expressiveness of environmental representations and improve agent coordination. Furthermore, inspired by the Mixture of Experts (MoE) architecture in Large Language Models (LLMs), a topology-assisted spatial pattern disentangling (TSD)-enhanced MoE is proposed, which leverages topological signatures to decouple graph features for specialized processing, thus improving the model's ability to characterize dynamic and heterogeneous local observations. The TSD module is also integrated into the policy and value networks of the Multi-agent Proximal Policy Optimization (MAPPO) algorithm, further improving decision-making efficiency and robustness. Extensive experiments conducted on real-world traffic scenarios, together with comprehensive theoretical analysis, validate the superior performance of the proposed framework, highlighting the model's scalability and effectiveness in addressing the complexities of large-scale TSC tasks.
中文: 本文提出了一种融合动态图神经网络和拓扑数据分析的新型多智能体强化学习框架,通过拓扑辅助的专家混合模块解耦交通特征,有效解决了大规模交通信号控制中的可扩展性问题,显著提升了复杂城市交通环境下的决策效率和系统鲁棒性。
English: This paper introduces a novel Multi-Agent Reinforcement Learning framework combining Dynamic Graph Neural Networks and Topological Data Analysis to address scalability challenges in Traffic Signal Control, enhanced by a topology-assisted Mixture of Experts module that improves environmental representation and decision-making efficiency in complex urban traffic scenarios.
Authors:Zhiwei Li, Guodong Long, Chunxu Zhang, Honglei Zhang, Jing Jiang, Chengqi Zhang
Abstract:
A core learning challenge for existed Foundation Models (FM) is striking the tradeoff between generalization with personalization, which is a dilemma that has been highlighted by various parameter-efficient adaptation techniques. Federated foundation models (FFM) provide a structural means to decouple shared knowledge from individual specific adaptations via decentralized processes. Recommendation systems offer a perfect testbed for FFMs, given their reliance on rich implicit feedback reflecting unique user characteristics. This position paper discusses a novel learning paradigm where FFMs not only harness their generalization capabilities but are specifically designed to preserve the integrity of user personality, illustrated thoroughly within the recommendation contexts. We envision future personal agents, powered by personalized adaptive FMs, guiding user decisions on content. Such an architecture promises a user centric, decentralized system where individuals maintain control over their personalized agents.
中文摘要:本文提出了一种新颖的学习范式,利用联邦基础模型在推荐系统中实现泛化与个性化的平衡,通过去中心化的用户控制机制来保护用户个性特征的完整性。
English Summary: This position paper proposes a novel learning paradigm using Federated Foundation Models (FFM) to balance generalization with personalization in recommendation systems, aiming to preserve user personality integrity through decentralized, user-centric control.
Authors:Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao
Abstract:
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
中文: PlayerOne是首个以自我为中心的逼真世界模拟器,通过从粗到精的训练流程和部件解耦运动控制技术,能够根据用户提供的场景图像精确构建动态环境并生成同步的自我中心视角视频,实现了对多样化场景的世界一致性建模。
English: PlayerOne is the first egocentric realistic world simulator that accurately constructs dynamic environments from user-provided scene images and generates synchronized egocentric videos using a coarse-to-fine training pipeline with part-disentangled motion control and joint 4D scene reconstruction.
Authors:Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed A. Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang
Abstract:
We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
Chinese: SensorLM是一系列传感器-语言基础模型,通过分层描述生成技术实现可穿戴传感器数据的自然语言理解,在零样本识别、小样本学习和跨模态检索等实际任务中展现出超越现有技术的卓越性能。
English: SensorLM is a family of sensor-language foundation models that enable natural language understanding of wearable sensor data, achieving superior performance in zero-shot recognition, few-shot learning, and cross-modal retrieval through hierarchical caption generation and extensive real-world experiments.
Authors:Fengjun Pan, Anh Tuan Luu, Xiaobao Wu
Abstract:
Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models' reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.
中文摘要:U-CoT+框架通过先将表情包转换为细节保留的文本描述,再结合人工指导的零样本思维链推理,实现了跨平台可解释的低资源有害表情包检测,在七个基准数据集上验证了其有效性。
English Summary: The U-CoT+ framework introduces a novel approach to harmful meme detection by first converting memes into detailed textual descriptions, then applying zero-shot chain-of-thought prompting with human-crafted guidelines, achieving resource-efficient and explainable detection across multiple benchmarks.
Authors:Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu
Abstract:
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
中文: RADAR是一个用于系统评估语言模型对表格数据中常见数据伪影处理能力的基准,尽管前沿模型在无伪影数据上表现良好,但引入伪影后性能显著下降,揭示了其鲁棒数据分析能力的关键不足。
English: RADAR is a benchmark designed to systematically evaluate the data awareness of language models when handling common data artifacts in real-world tabular data, revealing significant performance degradation in frontier models despite their decent performance on clean tables.
Authors:Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang Wang, Pan Li
Abstract:
Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model's ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, 'target' segments selectively attend only to the KV-caches of their designated 'source' segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.
中文摘要:Graph-KV通过利用KV缓存和结构归纳偏置,使文本片段能够选择性交互而非序列化处理,有效克服了大语言模型输入序列化的局限,在多种推理和检索增强任务中显著提升了性能。
English Summary: Graph-KV introduces a novel approach that overcomes the limitations of serialized input in large language models by leveraging KV-caches and structural inductive biases to enable selective attention between text segments, significantly improving performance across various reasoning and retrieval-augmented tasks.
Authors:Yinchao Zhang, Su Yao, Yong Feng, Kang Chen, Tong Li, Zhuotao Liu, Yi Zhao, Lexuan Zhang, Xiangyu Gao, Feng Xiong, Qi Li, Ke Xu
Abstract:
The paradigm of Intelligent DataPlane (IDP) embeds deep learning (DL) models on the network dataplane to enable intelligent traffic analysis at line-speed. However, the current use of the match-action table (MAT) abstraction on the dataplane is misaligned with DL inference, leading to several key limitations, including accuracy degradation, limited scale, and lack of generality. This paper proposes Pegasus to address these limitations. Pegasus translates DL operations into three dataplane-oriented primitives to achieve generality: Partition, Map, and SumReduce. Specifically, Partition "divides" high-dimensional features into multiple low-dimensional vectors, making them more suitable for the dataplane; Map "conquers" computations on the low-dimensional vectors in parallel with the technique of fuzzy matching, while SumReduce "combines" the computation results. Additionally, Pegasus employs Primitive Fusion to merge computations, improving scalability. Finally, Pegasus adopts full precision weights with fixed-point activations to improve accuracy. Our implementation on a P4 switch demonstrates that Pegasus can effectively support various types of DL models, including Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and AutoEncoder models on the dataplane. Meanwhile, Pegasus outperforms state-of-the-art approaches with an average accuracy improvement of up to 22.8%, along with up to 248x larger model size and 212x larger input scale.
Chinese: 本文提出Pegasus系统,通过将深度学习操作转化为数据平面友好的原语,解决了智能数据平面在精度、规模和通用性上的限制,实现了精度提升、模型规模和输入规模的大幅扩展。
English: The paper introduces Pegasus, a system that translates deep learning operations into dataplane-friendly primitives to overcome limitations in intelligent dataplane implementations, achieving significant improvements in accuracy, model size, and input scale.
Authors:Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, Daniel McDuff
Abstract:
Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.
中文摘要:本文提出的第二代大型传感器模型(LSM-2)采用自适应继承掩码(AIM)技术,无需数据填补即可直接从缺失的可穿戴传感器数据中学习鲁棒表征,在多项任务中实现最优性能,并在真实数据缺失场景下保持卓越的可靠性。
English Summary: The paper introduces LSM-2 with Adaptive and Inherited Masking (AIM), a self-supervised learning approach that learns robust representations directly from incomplete wearable sensor data without imputation, achieving state-of-the-art performance across diverse tasks while maintaining reliability under real-world missing data scenarios.
Authors:Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang
Abstract:
The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.
中文总结:研究者提出了EOC-Bench这一创新基准,通过3,277个精细标注的问答对系统评估动态第一人称场景中的物体具身认知能力,填补现有基准在交互动态变化评估上的空白,为提升多模态大模型的物体认知能力奠定基础。
English Summary: The authors introduce EOC-Bench, a novel benchmark with 3,277 annotated QA pairs designed to evaluate object-centric embodied cognition in dynamic egocentric scenarios, addressing limitations in existing benchmarks by assessing temporal changes and interaction effects across multiple dimensions.
Authors:Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala
Abstract:
Reinforcement learning (RL) has become the dominant paradigm for endowing language models with advanced reasoning capabilities. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of their advantages is still lacking. To address this gap, we introduce a fine-grained analytic framework to dissect the impact of RL on reasoning. Our framework specifically investigates key elements that have been hypothesized to benefit from RL training: (1) plan-following and execution, (2) problem decomposition, and (3) improved reasoning and knowledge utilization. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit step-by-step plans surprisingly degrades performance on the most challenging benchmarks, yet RL-tuned models exhibit greater robustness, experiencing markedly smaller performance drops than their base counterparts. This suggests that RL may not primarily enhance the execution of external plans but rather empower models to formulate and follow internal strategies better suited to their reasoning processes. Conversely, we observe that RL enhances the model's capacity to integrate provided knowledge into its reasoning process, leading to performance improvements across diverse tasks. We also study difficulty, showing improved training by developing new ways to exploit hard problems. Our findings lay a foundation for more principled training and evaluation of reasoning models.
Chinese: 强化学习通过使语言模型能够制定内部策略并更好地整合知识来提升其推理能力,而非仅仅执行外部计划,这一点通过细粒度分析框架得以揭示。
English: Reinforcement learning enhances language models' reasoning by enabling them to formulate internal strategies and better integrate knowledge, rather than merely executing external plans, as revealed through a fine-grained analytic framework.
Authors:Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
Abstract:
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
中文摘要:LayerFlow是一个统一的层感知视频生成框架,通过分层嵌入和多阶段训练策略,能够根据分层提示生成透明前景、干净背景及融合场景的视频内容。
English Summary: LayerFlow is a unified framework for layer-aware video generation that uses layer embeddings and a multi-stage training strategy to produce transparent foregrounds, clean backgrounds, and blended scenes from per-layer prompts.
Authors:Hengyu Liu, Yuehao Wang, Chenxin Li, Ruisi Cai, Kevin Wang, Wuyang Li, Pavlo Molchanov, Peihao Wang, Zhangyang Wang
Abstract:
3D Gaussian splatting (3DGS) has enabled various applications in 3D scene representation and novel view synthesis due to its efficient rendering capabilities. However, 3DGS demands relatively significant GPU memory, limiting its use on devices with restricted computational resources. Previous approaches have focused on pruning less important Gaussians, effectively compressing 3DGS but often requiring a fine-tuning stage and lacking adaptability for the specific memory needs of different devices. In this work, we present an elastic inference method for 3DGS. Given an input for the desired model size, our method selects and transforms a subset of Gaussians, achieving substantial rendering performance without additional fine-tuning. We introduce a tiny learnable module that controls Gaussian selection based on the input percentage, along with a transformation module that adjusts the selected Gaussians to complement the performance of the reduced model. Comprehensive experiments on ZipNeRF, MipNeRF and Tanks\&Temples scenes demonstrate the effectiveness of our approach. Code is available at https://flexgs.github.io.
Chinese: 本文提出了一种针对3D高斯溅射的弹性推理方法,可根据指定模型尺寸动态选择和调整高斯元素,无需微调即可实现高效渲染。
English: This paper introduces an elastic inference method for 3D Gaussian splatting that dynamically selects and adjusts Gaussians based on specified model size, achieving efficient rendering without fine-tuning.
Authors:Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, Yueming Jin
Abstract:
Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).
中文: 基础模型在生物医学领域已取得变革性成功,但在手术中的应用仍待探索,为此我们开发了SurgVLM——首个基于大规模手术数据库构建的视觉语言基础模型,能够统一处理从视觉感知到高阶推理的多样化手术任务。
English: Foundation models have revolutionized biomedical fields but remain underutilized in surgery, prompting the development of SurgVLM, a large vision-language model trained on a comprehensive surgical database to address unique challenges like visual perception and reasoning.
Authors:Raj Patel, Himanshu Tripathi, Jasper Stone, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi, Vini Chaudhary
Abstract:
The rapid adoption of machine learning (ML) technologies has driven organizations across diverse sectors to seek efficient and reliable methods to accelerate model development-to-deployment. Machine Learning Operations (MLOps) has emerged as an integrative approach addressing these requirements by unifying relevant roles and streamlining ML workflows. As the MLOps market continues to grow, securing these pipelines has become increasingly critical. However, the unified nature of MLOps ecosystem introduces vulnerabilities, making them susceptible to adversarial attacks where a single misconfiguration can lead to compromised credentials, severe financial losses, damaged public trust, and the poisoning of training data. Our paper presents a systematic application of the MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework, a comprehensive and continuously updated catalog of AI-focused attacks, to systematically assess attacks across different phases of the MLOps ecosystem. We begin by examining the preparatory phases during which adversaries acquire the essential intelligence required to initiate their attacks. We then present a structured taxonomy of attack techniques explicitly mapped to corresponding phases of the MLOps ecosystem, supported by examples drawn from red-teaming exercises and real-world incidents. This is followed by a taxonomy of mitigation strategies aligned with these attack categories, offering actionable early-stage defenses to strengthen the security of MLOps ecosystem. Given the rapid evolution and adoption of MLOps, we further highlight key research gaps that require immediate attention. Our work emphasizes the importance of implementing robust security protocols from the outset, empowering practitioners to safeguard MLOps ecosystem against evolving cyber attacks.
中文: 本文系统应用MITRE ATLAS框架评估MLOps生态系统中的对抗性威胁,提出攻击技术与缓解策略的结构化分类法,并强调关键研究空白以加强安全防护。
English: This paper systematically applies the MITRE ATLAS framework to assess adversarial threats across MLOps ecosystems, presenting structured taxonomies of attack techniques and mitigation strategies while highlighting critical research gaps to strengthen security protocols.
Authors:Jialong Zuo, Shengpeng Ji, Minghui Fang, Mingze Li, Ziyue Jiang, Xize Cheng, Xiaoda Yang, Chen Feiyang, Xinyu Duan, Zhou Zhao
Abstract:
Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.
中文摘要:R-VC是一种节奏可控的零样本语音转换模型,通过时长建模实现目标说话人节奏迁移,并采用带捷径流匹配的扩散变换器实现高效高质量语音合成。
English Summary: R-VC is a rhythm-controllable zero-shot voice conversion model that transfers target speaker's rhythm through duration modeling and generates high-quality speech efficiently using a diffusion transformer with shortcut flow matching.
Authors:Peijin Guo, Minghui Li, Hewen Pan, Bowen Chen, Yang Wu, Zikang Guo, Leo Yu Zhang, Shengshan Hu, Shengqing Hu
Abstract:
Accurate prediction of molecular metabolic stability (MS) is critical for drug research and development but remains challenging due to the complex interplay of molecular interactions. Despite recent advances in graph neural networks (GNNs) for MS prediction, current approaches face two critical limitations: (1) incomplete molecular modeling due to atom-centric message-passing mechanisms that disregard bond-level topological features, and (2) prediction frameworks that lack reliable uncertainty quantification. To address these challenges, we propose TrustworthyMS, a novel contrastive learning framework designed for uncertainty-aware metabolic stability prediction. First, a molecular graph topology remapping mechanism synchronizes atom-bond interactions through edge-induced feature propagation, capturing both localized electronic effects and global conformational constraints. Second, contrastive topology-bond alignment enforces consistency between molecular topology views and bond patterns via feature alignment, enhancing representation robustness. Third, uncertainty modeling through Beta-Binomial uncertainty quantification enables simultaneous prediction and confidence calibration under epistemic uncertainty. Through extensive experiments, our results demonstrate that TrustworthyMS outperforms current state-of-the-art methods in terms of predictive performance.
中文: TrustworthyMS提出了一种新颖的对比学习框架,通过整合原子-键相互作用、拓扑-键对齐和不确定性量化,克服了分子代谢稳定性预测的现有局限,实现了优于现有方法的性能。
English: TrustworthyMS introduces a novel contrastive learning framework that overcomes limitations in molecular metabolic stability prediction by integrating atom-bond interactions, topology-bond alignment, and uncertainty quantification, achieving superior performance over existing methods.
Authors:Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Kam-Fai Wong
Abstract:
As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently. We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction. Building on this framework, we advocate for aligning an agent's tool use decision-making boundary with its knowledge boundary, thereby minimizing unnecessary tool use and maximizing epistemic efficiency. This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior.
Chinese: 该立场文件主张,自主智能体必须建立在统一的认知框架上,将内部推理与外部行动视为等效工具,通过决策边界与知识边界对齐实现高效知识获取和目标导向行为。
English: This position paper argues that autonomous agents must be grounded in a coherent epistemic framework that unifies internal reasoning and external actions as equivalent tools, enabling efficient knowledge acquisition and goal-directed behavior by aligning decision-making boundaries with knowledge boundaries.
Authors:Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, Bhavna Gopal, Maziyar Baran Pouyan, Changwei Liu, Hai Li, Yiran Chen
Abstract:
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
中文摘要:提出的STREAM防御机制通过安全推理调节器,有效保护大语言模型免受多轮对话攻击,在保持模型功能的同时显著降低了攻击成功率。
English Summary: The proposed STREAM defense mechanism effectively protects large language models from multi-turn dialogue attacks by using a safety reasoning moderator, significantly reducing attack success rates while preserving model functionality.
Authors:Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam
Abstract:
Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce \sys, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. \sys constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed \data, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. \data comprises \textbf{2293} meticulously annotated interaction records, covering \textbf{15} risk types across \textbf{29} application scenarios. A key feature of \data is its nuanced approach to ambiguous risk situations, employing ``Strict'' and ``Lenient'' judgment standards. Experiments demonstrate that \sys not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly openly accessible.
中文摘要:提出的 \sys 框架通过记忆增强推理和全新基准测试,显著提升了基于大语言模型的智能体安全评估能力,达到了人类水平的判断精度。
English Summary: The proposed \sys framework enhances LLM-based safety and security evaluation by using memory-augmented reasoning and a comprehensive new benchmark, achieving human-level accuracy.
Authors:Di Zhang, Weida Wang, Junxian Li, Xunzhi Wang, Jiatong Li, Jianbo Wu, Jingdi Lei, Haonan He, Peng Ye, Shufei Zhang, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou
Abstract:
This paper target in addressing the challenges of underthinking and overthinking in long chain-of-thought (CoT) reasoning for Large Reasoning Models (LRMs) by introducing Reasoning Control Fields (RCF)--a novel test-time approach that injects structured control signals to guide reasoning from a tree search perspective. RCF enables models to adjust reasoning effort according to given control conditions when solving complex tasks. Additionally, we present the Control-R-4K dataset, which consists of challenging problems annotated with detailed reasoning processes and corresponding control fields. To further enhance reasoning control, we propose a Conditional Distillation Finetuning (CDF) method, which trains model--particularly Control-R-32B--to effectively adjust reasoning effort during test time. Experimental results on benchmarks such as AIME2024 and MATH500 demonstrate that our approach achieves state-of-the-art performance at the 32B scale while enabling a controllable Long CoT reasoning process (L-CoT). Overall, this work introduces an effective paradigm for controllable test-time scaling reasoning.
本文提出推理控制场(RCF)方法,通过注入结构化控制信号解决长思维链推理中的欠思考与过思考问题,使模型能够根据控制条件调整推理强度,并在测试时实现可控的长链推理过程。
This paper introduces Reasoning Control Fields (RCF) to manage underthinking and overthinking in long chain-of-thought reasoning, enabling models to adjust reasoning effort and achieve state-of-the-art performance through a novel test-time control approach.
Authors:Radhika Juglan, Marta Ligero, Zunamys I. Carrero, Asier Rabasco, Tim Lenz, Leo Misera, Gregory Patrick Veldhuizen, Paul Kuntke, Hagen H. Kitzler, Sven Nebelung, Daniel Truhn, Jakob Nikolas Kather
Abstract:
Deep learning (DL) methods are increasingly outperforming classical approaches in brain imaging, yet their generalizability across diverse imaging cohorts remains inadequately assessed. As age and sex are key neurobiological markers in clinical neuroscience, influencing brain structure and disease risk, this study evaluates three of the existing three-dimensional architectures, namely Simple Fully Connected Network (SFCN), DenseNet, and Shifted Window (Swin) Transformers, for age and sex prediction using T1-weighted MRI from four independent cohorts: UK Biobank (UKB, n=47,390), Dallas Lifespan Brain Study (DLBS, n=132), Parkinson's Progression Markers Initiative (PPMI, n=108 healthy controls), and Information eXtraction from Images (IXI, n=319). We found that SFCN consistently outperformed more complex architectures with AUC of 1.00 [1.00-1.00] in UKB (internal test set) and 0.85-0.91 in external test sets for sex classification. For the age prediction task, SFCN demonstrated a mean absolute error (MAE) of 2.66 (r=0.89) in UKB and 4.98-5.81 (r=0.55-0.70) across external datasets. Pairwise DeLong and Wilcoxon signed-rank tests with Bonferroni corrections confirmed SFCN's superiority over Swin Transformer across most cohorts (p<0.017, for three comparisons). Explainability analysis further demonstrates the regional consistency of model attention across cohorts and specific to each task. Our findings reveal that simpler convolutional networks outperform the denser and more complex attention-based DL architectures in brain image analysis by demonstrating better generalizability across different datasets.
中文: 研究表明,在多个脑成像队列中,简单的全连接网络(SFCN)在通过MRI预测年龄和性别时,始终优于更复杂的深度学习架构,展现出更好的泛化能力和任务相关的区域关注一致性。
English: This study demonstrates that the simpler Simple Fully Connected Network (SFCN) consistently outperforms more complex deep learning architectures in predicting age and sex from brain MRI data across multiple cohorts, showing superior generalizability and task-specific regional attention consistency.
Authors:Yifan Wang, Weinan Gan, Longtao Xiao, Jieming Zhu, Heng Chang, Haozhao Wang, Rui Zhang, Zhenhua Dong, Ruiming Tang, Ruixuan Li
Abstract:
Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "$\textbf{what}$", which may constrain the full potential of GR. To this end, we present Chunk AutoRegressive Modeling (CAR), a new generation paradigm following the decision pattern that users usually think semantic aspects of items (e.g. brand) and then take actions on target items (e.g. purchase). Our CAR, for the $\textit{first time}$, incorporates semantics (SIDs) and behavior (UID) into a single autoregressive transformer from an ``act-with-think'' dual perspective via chunk-level autoregression. Specifically, CAR packs SIDs and UID into a conceptual chunk for item unified representation, allowing each decoding step to make a holistic prediction. Experiments show that our CAR significantly outperforms existing methods based on traditional AR, improving Recall@5 by 7.93% to 22.30%. Furthermore, we verify the scaling effect between model performance and SIDs bit number, demonstrating that CAR preliminary emulates a kind of slow-thinking style mechanism akin to the reasoning processes observed in large language models (LLMs).
中文: 生成式推荐常忽略用户行为与物品语义的内在联系,因此我们提出分块自回归建模(CAR),将两者整合到统一框架中,实现更准确且可解释的预测,并显著提升性能。
English: Generative recommendation often misses the link between user behavior and item semantics, so we introduce Chunk AutoRegressive Modeling (CAR), which integrates both into a unified framework for more accurate and explainable predictions, showing significant performance gains.
Authors:Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Abstract:
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
中文:智能搜索系统如深度研究能自主浏览并整合网络信息,但其复杂性超越了现有评估方法,为此开发了Mind2Web 2基准和“代理即裁判”框架进行自动评估,顶尖系统仅用一半时间即可达到人类50-70%的效能。
English: Agentic search systems like Deep Research autonomously browse and synthesize web information, but their complexity exceeds current evaluation methods, leading to the creation of Mind2Web 2 benchmark and an Agent-as-a-Judge framework for automated assessment, showing top systems achieve 50-70% of human efficiency in half the time.
Authors:Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Abstract:
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
Chinese: MindCube基准测试显示现有视觉语言模型在空间推理方面存在明显不足,但采用"先建图后推理"的方法结合强化学习,可将其准确率从37.8%显著提升至70.7%。
English: The MindCube benchmark reveals that current Vision Language Models struggle with spatial reasoning, but a "map-then-reason" approach combining cognitive mapping with reinforcement learning dramatically improves their accuracy from 37.8% to 70.7%.
Authors:Minghao Qin, Yan Shu, Peitian Zhang, Kun Lun, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Abstract:
Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.
中文: Video-X^2L通过双层KV压缩和选择性重加载技术,有效解决了多模态大语言模型处理长视频的计算难题,在无需额外训练的情况下大幅提升性能并节约成本。
English: Video-X^2L addresses the computational challenge of long-video understanding in MLLMs by employing bi-level KV compression and selective re-loading, enhancing performance without extra training while significantly reducing costs.
Authors:Antonio Calagna, Stefano Maxenti, Leonardo Bonati, Salvatore D'Oro, Tommaso Melodia, Carla Fabiana Chiasserini
Abstract:
Open Radio Access Network (RAN) is a key paradigm to attain unprecedented flexibility of the RAN via disaggregation and Artificial Intelligence (AI)-based applications called xApps. In dense areas with many active RAN nodes, compute resources are engineered to support potentially hundreds of xApps monitoring and controlling the RAN to achieve operator's intents. However, such resources might become underutilized during low-traffic periods, where most cells are sleeping and, given the reduced RAN complexity, only a few xApps are needed for its control. In this paper, we propose CORMO-RAN, a data-driven orchestrator that dynamically activates compute nodes based on xApp load to save energy, and performs lossless migration of xApps from nodes to be turned off to active ones while ensuring xApp availability during migration. CORMO-RAN tackles the trade-off among service availability, scalability, and energy consumption while (i) preserving xApps' internal state to prevent RAN performance degradation during migration; (ii) accounting for xApp diversity in state size and timing constraints; and (iii) implementing several migration strategies and providing guidelines on best strategies to use based on resource availability and requirements. We prototype CORMO-RAN as an rApp, and experimentally evaluate it on an O-RAN private 5G testbed hosted on a Red Hat OpenShift cluster with commercial radio units. Results demonstrate that CORMO-RAN is effective in minimizing energy consumption of the RAN Intelligent Controller (RIC) cluster, yielding up to 64% energy saving when compared to existing approaches.
中文:CORMO-RAN是一种动态编排器,通过在低流量期间智能管理计算节点并迁移xApps来优化开放无线接入网络的能效,在保持服务可用性的同时实现高达64%的节能效果。
English: CORMO-RAN is a dynamic orchestrator that optimizes energy efficiency in Open RAN by intelligently managing compute nodes and migrating xApps during low-traffic periods, achieving up to 64% energy savings while maintaining service availability.
Authors:Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Abstract:
Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model's ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.
中文: Video-XL-2 是一种新颖的多模态大语言模型,通过任务感知的KV稀疏化技术解决了长视频理解的挑战,在单个GPU上快速处理数千帧视频,实现了顶尖性能与卓越效率。
English: Video-XL-2 is a novel multi-modal large language model that addresses the challenge of long video understanding through task-aware KV sparsification, achieving state-of-the-art performance and exceptional efficiency by processing thousands of frames rapidly on a single GPU.
Authors:Kuanning Wang, Yuqian Fu, Tianyu Wang, Yanwei Fu, Longfei Liang, Yu-Gang Jiang, Xiangyang Xue
Abstract:
Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose .
中文:RAG-6DPose通过融合多模态CAD知识库的视觉与几何线索,提升了机器人操作中的6D姿态估计精度,在遮挡和新视角场景下表现尤为鲁棒。
English: RAG-6DPose enhances 6D pose estimation for robotic manipulation by integrating a multi-modal CAD knowledge base with visual and geometric cues, improving accuracy in occluded and novel-view scenarios.
Authors:Liuhuo Wan, Chuan Yan, Mark Huasong Meng, Kailong Wang, Haoyu Wang, Guangdong Bai, Jin Song Dong
Abstract:
Nowadays team workspaces are widely adopted for multi-user collaboration and digital resource management. To further broaden real-world applications, mainstream team workspaces platforms, such as Google Workspace and Microsoft OneDrive, allow third-party applications (referred to as add-ons) to be integrated into their workspaces, significantly extending the functionality of team workspaces. The powerful multi-user collaboration capabilities and integration of add-ons make team workspaces a central hub for managing shared resources and protecting them against unauthorized access. Due to the collaboration features of team workspaces, add-ons involved in collaborations may bypass the permission isolation enforced by the administrator, unlike in single-user permission management.
This paper aims to investigate the permission management landscape of team workspaces add-ons. To this end, we perform an in-depth analysis of the enforced access control mechanism inherent in this ecosystem, considering both multi-user and cross-app features. We identify three potential security risks that can be exploited to cause permission escalation. We then systematically reveal the landscape of permission escalation risks in the current ecosystem. Specifically, we propose an automated tool, TAI, to systematically test all possible interactions within this ecosystem. Our evaluation reveals that permission escalation vulnerabilities are widespread in this ecosystem, with 41 interactions identified as problematic. Our findings should raise an alert to both the team workspaces platforms and third-party developers.
中文: 本文研究团队工作空间插件的权限管理现状,通过自动化测试发现当前生态系统中普遍存在权限提升漏洞,共识别出41个存在问题的交互场景。
English: This paper investigates security risks in team workspace add-ons, identifying widespread permission escalation vulnerabilities through automated testing that reveal 41 problematic interactions in current ecosystems.
Authors:Yuxin Chen, Jianglan Wei, Chenfeng Xu, Boyi Li, Masayoshi Tomizuka, Andrea Bajcsy, Ran Tian
Abstract:
World models enable robots to "imagine" future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI "reimagines" future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.
Chinese Summary: 提出的ReOI方法通过检测并移除预测过程中的视觉干扰物,随后再重新引入,显著提升了世界模型在开放环境中的可靠性,使机器人任务成功率在面对新型干扰时最高提升3倍。
English Summary: The proposed Reimagination with Observation Intervention (ReOI) method enhances world model robustness by detecting and removing visual distractors during prediction, then reintroducing them post-hoc, achieving up to 3x higher task success rates against novel distractors in robotic manipulation.
Authors:Sirui Li, Shuai Wang, Zhijun Liu, Zhongjie Jiang, Yannan Wang, Haizhou Li
Abstract:
Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To address this, we introduce SpeechRefiner, a post-processing tool that utilizes Conditional Flow Matching (CFM) to improve the perceptual quality of speech. In this study, we benchmark SpeechRefiner against recent task-specific refinement methods and evaluate its performance within our internal processing pipeline, which integrates multiple front-end algorithms. Experiments show that SpeechRefiner exhibits strong generalization across diverse impairment sources, significantly enhancing speech perceptual quality. Audio demos can be found at https://speechrefiner.github.io/SpeechRefiner/.
Chinese: SpeechRefiner是一种后处理工具,采用条件流匹配技术来提升语音感知质量,有效解决前端处理产生的残留噪声和伪影问题,并在多种损伤源上展现出强大的泛化能力。
English: SpeechRefiner is a post-processing tool that uses Conditional Flow Matching to enhance the perceptual quality of speech by addressing residual noise and artifacts from front-end processing, demonstrating strong generalization across various impairment sources.
Authors:Boyang Wang, Yuhao Song, Jinyuan Cao, Peng Yu, Hongcheng Guo, Zhoujun Li
Abstract:
Children's emotional development fundamentally relies on secure attachment relationships, yet current AI companions lack the theoretical foundation to provide developmentally appropriate emotional support. We introduce DinoCompanion, the first attachment-theory-grounded multimodal robot for emotionally responsive child-AI interaction. We address three critical challenges in child-AI systems: the absence of developmentally-informed AI architectures, the need to balance engagement with safety, and the lack of standardized evaluation frameworks for attachment-based capabilities. Our contributions include: (i) a multimodal dataset of 128 caregiver-child dyads containing 125,382 annotated clips with paired preference-risk labels, (ii) CARPO (Child-Aware Risk-calibrated Preference Optimization), a novel training objective that maximizes engagement while applying epistemic-uncertainty-weighted risk penalties, and (iii) AttachSecure-Bench, a comprehensive evaluation benchmark covering ten attachment-centric competencies with strong expert consensus (\k{appa}=0.81). DinoCompanion achieves state-of-the-art performance (57.15%), outperforming GPT-4o (50.29%) and Claude-3.7-Sonnet (53.43%), with exceptional secure base behaviors (72.99%, approaching human expert levels of 78.4%) and superior attachment risk detection (69.73%). Ablations validate the critical importance of multimodal fusion, uncertainty-aware risk modeling, and hierarchical memory for coherent, emotionally attuned interactions.
中文: DinoCompanion是首个基于依恋理论的多模态机器人,通过新型数据集、风险校准训练方法和全面评估基准解决儿童-AI互动中的关键挑战,在安全依恋行为和风险检测方面达到领先水平。
English: DinoCompanion is the first attachment-theory-based multimodal robot designed to provide emotionally responsive child-AI interaction, addressing key challenges through a novel dataset, a risk-calibrated training method, and a comprehensive evaluation benchmark, achieving state-of-the-art performance in secure attachment behaviors and risk detection.
Authors:Chengqing Yu, Fei Wang, Chuanguang Yang, Zezhi Shao, Tao Sun, Tangwen Qian, Wei Wei, Zhulin An, Yongjun Xu
Abstract:
Multivariate Time Series Forecasting (MTSF) involves predicting future values of multiple interrelated time series. Recently, deep learning-based MTSF models have gained significant attention for their promising ability to mine semantics (global and local information) within MTS data. However, these models are pervasively susceptible to missing values caused by malfunctioning data collectors. These missing values not only disrupt the semantics of MTS, but their distribution also changes over time. Nevertheless, existing models lack robustness to such issues, leading to suboptimal forecasting performance. To this end, in this paper, we propose Multi-View Representation Learning (Merlin), which can help existing models achieve semantic alignment between incomplete observations with different missing rates and complete observations in MTS. Specifically, Merlin consists of two key modules: offline knowledge distillation and multi-view contrastive learning. The former utilizes a teacher model to guide a student model in mining semantics from incomplete observations, similar to those obtainable from complete observations. The latter improves the student model's robustness by learning from positive/negative data pairs constructed from incomplete observations with different missing rates, ensuring semantic alignment across different missing rates. Therefore, Merlin is capable of effectively enhancing the robustness of existing models against unfixed missing rates while preserving forecasting accuracy. Experiments on four real-world datasets demonstrate the superiority of Merlin.
中文: 本文提出多视图表征学习(Merlin)方法,通过离线知识蒸馏和多视图对比学习,增强多元时间序列预测模型对不同缺失率数据的鲁棒性,实现语义对齐并保持预测精度。
English: This paper introduces Multi-View Representation Learning (Merlin), a method that enhances the robustness of multivariate time series forecasting models against varying missing rates through offline knowledge distillation and multi-view contrastive learning, ensuring semantic alignment and maintaining forecasting accuracy.
Authors:Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian
Abstract:
Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretical framework to explain model performance. This work investigates sharpness as a theoretical proxy for generalization in SDD. We analyze how sharpness responds to domain shifts and find it increases in unseen conditions, indicating higher model sensitivity. Based on this, we apply Sharpness-Aware Minimization (SAM) to reduce sharpness explicitly, leading to better and more stable performance across diverse unseen test sets. Furthermore, correlation analysis confirms a statistically significant relationship between sharpness and generalization in most test settings. These findings suggest that sharpness can serve as a theoretical indicator for generalization in SDD and that sharpness-aware training offers a promising strategy for improving robustness.
中文: 本研究将锐度确立为语音深度伪造检测泛化能力的理论指标,证明通过锐度感知最小化降低锐度可提升模型在不同测试条件下的鲁棒性和稳定性。
English: This study establishes sharpness as a theoretical indicator of generalization in speech deepfake detection, demonstrating that reducing sharpness through Sharpness-Aware Minimization improves model robustness and stability across diverse test conditions.
Authors:Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel
Abstract:
Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train a foundation model for estimating conditional average treatment effects (CATEs) using back-door adjustment. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.
中文: CausalFM提出了一种基于先验数据拟合网络的创新框架,可在多种因果推断场景中进行贝叶斯推理,通过情境学习实现优异性能,有望革新医学、经济学等领域的实践方法。
English: CausalFM introduces a novel framework using prior-data fitted networks for Bayesian causal inference across various settings, achieving competitive performance through in-context learning and potentially transforming practices in fields like medicine and economics.
Authors:Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel
Abstract:
Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including for back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train models to perform in-context learning in these settings. We show that CausalFM achieves competitive in-context learning performance even when compared to baselines that are specifically trained for the task at hand. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.
中文: CausalFM提出了一种基于先验数据拟合网络的创新框架,可在多种因果推断场景中进行贝叶斯推理,通过情境学习实现优异性能,有望革新医学、经济学等领域的实践方法。
English: CausalFM introduces a novel framework using prior-data fitted networks for Bayesian causal inference across various settings, achieving competitive performance through in-context learning and potentially transforming practices in fields like medicine and economics.
Authors:Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
Abstract:
Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.
中文: 摘要介绍了LEO-VL这一3D视觉语言模型,它采用压缩特征网格(CFG)实现高效场景表示,并通过SceneDPO增强鲁棒性,在多个基准测试中取得领先性能,同时解决了3D-VL学习中的可扩展性和性能平衡问题。
English: The abstract introduces LEO-VL, a 3D vision-language model that utilizes the condensed feature grid (CFG) for efficient scene representation and SceneDPO for enhanced robustness, achieving state-of-the-art results on multiple benchmarks while addressing scalability and performance challenges in 3D-VL learning.
Authors:Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
Abstract:
Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding goal in the 3D-VL community. Despite recent progress, 3D VLMs still fall short of their 2D counterparts in capability and robustness. A key bottleneck is that current scene representations struggle to balance performance and efficiency: competitive performance comes at the cost of heavy token overhead, which in turn hampers the scalability of 3D-VL learning. To address this, we propose the condensed feature grid (CFG), an efficient scene representation featuring significantly reduced token overhead and strong perception capability. Building on CFG, we introduce LEO-VL, a 3D VLM trained on 700k 3D-VL data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To enhance the robustness of 3D VLM, we further propose SceneDPO for post-training, which involves contrasts across answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Our extensive experiments highlight the efficiency of our representation, the benefit of task and scene diversity, consistent scaling effects, and the advantages of SceneDPO compared to SFT and GRPO. We hope our findings advance the efficiency, scalability, and robustness of future 3D VLMs.
中文: 摘要介绍了LEO-VL这一3D视觉语言模型,它采用压缩特征网格(CFG)实现高效场景表示,并通过SceneDPO增强鲁棒性,在多个基准测试中取得领先性能,同时解决了3D-VL学习中的可扩展性和性能平衡问题。
English: The abstract introduces LEO-VL, a 3D vision-language model that utilizes the condensed feature grid (CFG) for efficient scene representation and SceneDPO for enhanced robustness, achieving state-of-the-art results on multiple benchmarks while addressing scalability and performance challenges in 3D-VL learning.
Authors:Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo
Abstract:
Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance.To address this impediment, we propose a unified rotary position embedding (Unified RoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this Unified RoPE, we introduce TransXSSM, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4 sequenceK length, TransXSSM exhibits training and inference speeds that are 42.3% and 29.5% faster, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4% on language modeling benchmarks.TransXSSM furthermore scales more effectively: TransXSSM-1.3B gains 7.22% in average accuracy over its 320M version (versus about 6% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.
中文摘要:提出的统一旋转位置编码方法解决了Transformer与状态空间模型之间的位置编码不兼容问题,实现了TransXSSM混合架构,在长序列建模中获得了更快的速度与更高的准确率。
English Summary: The proposed Unified RoPE method resolves positional encoding incompatibility between Transformers and State Space Models, enabling TransXSSM—a hybrid architecture that achieves faster speeds and higher accuracy in long-sequence modeling.
Authors:Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
Abstract:
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/
中文: UAD是一种无监督方法,无需人工标注即可从基础模型中提取功能知识到任务条件模型中,使机器人经过模拟数据训练后能够将功能预测泛化到真实场景和多样化人类活动中。
English: UAD is an unsupervised method that distills affordance knowledge from foundation models into a task-conditioned model without manual annotations, enabling robots to generalize affordance predictions to real-world scenes and diverse human activities after training on simulated data.
Authors:Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.
The Self-aware Weakness-driven problem Synthesis (SwS) framework enhances reinforcement learning for large language models by identifying their persistent errors during training and generating targeted problems to address these weaknesses, resulting in significant performance improvements of 10.0% and 7.7% on 7B and 32B models across reasoning benchmarks.
English Summary:
Authors:Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang, Zixin Zhang, Bin Wang, Bo Li, Buyun Ma, Changxin Miao, Changyi Wan, Chen Xu, Dapeng Shi, Dingyuan Hu, Enle Liu, Guanzhe Huang, Gulin Yan, Hanpeng Hu, Haonan Jia, Jiahao Gong, Jiaoren Wu, Jie Wu, Jie Yang, Junzhe Lin, Kaixiang Li, Lei Xia, Longlong Gu, Ming Li, Nie Hao, Ranchen Ming, Shaoliang Pang, Siqi Liu, Song Yuan, Tiancheng Cao, Wen Li, Wenqing He, Xu Zhao, Xuelin Zhang, Yanbo Yu, Yinmin Zhong, Yu Zhou, Yuanwei Liang, Yuanwei Lu, Yuxiang Yang, Zidong Yang, Zili Zhang, Binxing Jiao, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Daxin Jiang, Shuchang Zhou, Chen Hu
Abstract:
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
Chinese: Step-Audio-AQAA 是一种端到端的大型音频语言模型,通过双码本分词器和神经声码器直接生成高保真语音响应,在音频问答任务中性能优于现有模型。
English: Step-Audio-AQAA is an end-to-end large audio-language model that integrates a dual-codebook tokenizer and a neural vocoder to directly generate high-fidelity speech responses, outperforming existing models in audio query-answer tasks.
Authors:Matteo Bordin, Madhukara S. Holla, Sakthivel Velumani, Salvatore D'Oro, Tommaso Melodia
Abstract:
The application of small-factor, 5G-enabled Unmanned Aerial Vehicles (UAVs) has recently gained significant interest in various aerial and Industry 4.0 applications. However, ensuring reliable, high-throughput, and low-latency 5G communication in aerial applications remains a critical and underexplored problem. This paper presents the 5th generation (5G) Aero, a compact UAV optimized for 5G connectivity, aimed at fulfilling stringent 3rd Generation Partnership Project (3GPP) requirements. We conduct a set of experiments in an indoor environment, evaluating the UAV's ability to establish high-throughput, low-latency communications in both Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) conditions. Our findings demonstrate that the 5G Aero meets the required 3GPP standards for Command and Control (C2) packets latency in both LoS and NLoS, and video latency in LoS communications and it maintains acceptable latency levels for video transmission in NLoS conditions. Additionally, we show that the 5G module installed on the UAV introduces a negligible 1% decrease in flight time, showing that 5G technologies can be integrated into commercial off-the-shelf UAVs with minimal impact on battery lifetime. This paper contributes to the literature by demonstrating the practical capabilities of current 5G networks to support advanced UAV operations in telecommunications, offering insights into potential enhancements and optimizations for UAV performance in 5G networks
中文: 本文推出的5G Aero无人机在视距与非视距环境下均满足3GPP标准对控制及视频传输的时延要求,且对续航影响微乎其微,证实了5G技术支持高级无人机应用的可行性。
English: This paper introduces the 5G Aero UAV, which successfully meets 3GPP latency standards for both control and video transmission in various conditions while maintaining minimal impact on flight time, demonstrating 5G's viability for advanced drone operations.
Authors:Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue
Abstract:
This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.
中文: 本文提出YOEO单阶段方法,可同时实现实例分割和NPCS表征,用于关节物体的类别级姿态估计,在机器人操作任务中达到200Hz实时性能。
English: This paper introduces YOEO, a single-stage method that simultaneously performs instance segmentation and NPCS representation for real-time category-level pose estimation of articulated objects, achieving 200Hz performance in robotic manipulation tasks.
Authors:Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein
Abstract:
Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.
中文: 本文提出了一种基于几何的长期空间记忆框架,通过存储和检索三维信息来增强视频世界模型的场景一致性,评估显示其在质量和连贯性方面优于现有基准方法。
English: This paper introduces a geometry-grounded long-term spatial memory framework to enhance scene consistency in video world models, overcoming limitations of temporal context windows and demonstrating improved quality and coherence in evaluations.
Authors:Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Zheyu Ye, Zhoujun Li, Zuozhu Liu
Abstract:
As interest in using Large Language Models (LLMs) for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate complex pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.
中文: 本文提出Pet-Bench这一专门基准,通过自主进化和交互任务评估大语言模型在虚拟宠物陪伴中的表现,对28个模型的测试揭示了性能差异,凸显了该领域专门优化的必要性。
English: This paper introduces Pet-Bench, a specialized benchmark for evaluating Large Language Models in virtual pet companionship through self-evolution and interactive tasks, revealing performance variations among 28 models that highlight the need for domain-specific optimization.
Authors:Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Abstract:
Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called \textbf{nebula anchors}, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, \textit{e.g.} the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.
中文: 本文提出了一种名为星云变分编码(NVC)的变分推理模型,通过引入星云锚点引导潜在特征形成聚类,并利用概率优化和自监督度量学习提升多种任务的性能。
English: This paper introduces Nebula Variational Coding (NVC), a variational inference model that uses nebula anchors to guide latent features into clusters, enhancing performance across various tasks through probabilistic optimization and self-supervised metric learning.
Authors:Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang
Abstract:
The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \textbf{the fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures.} Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \textbf{implementation gap}, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.
中文摘要:人工智能科学家范式在自动化科研流程中展现出潜力,但其根本瓶颈在于验证执行能力的缺失,导致虽能产出被认可的研究成果,却难以实现突破性科学发现。
English Summary: The AI Scientist paradigm shows promise in automating scientific workflows but is hindered by a critical implementation gap, primarily due to insufficient verification capabilities that prevent groundbreaking achievements despite some accepted research outputs.
Authors:Gustav Müller-Franzes, Lorena Escudero Sánchez, Nicholas Payne, Alexandra Athanasiou, Michael Kalogeropoulos, Aitor Lopez, Alfredo Miguel Soro Busto, Julia Camps Herrero, Nika Rasoolzadeh, Tianyu Zhang, Ritse Mann, Debora Jutz, Maike Bode, Christiane Kuhl, Wouter Veldhuis, Oliver Lester Saldanha, JieFu Zhu, Jakob Nikolas Kather, Daniel Truhn, Fiona J. Gilbert
Abstract:
Detecting breast cancer early is of the utmost importance to effectively treat the millions of women afflicted by breast cancer worldwide every year. Although mammography is the primary imaging modality for screening breast cancer, there is an increasing interest in adding magnetic resonance imaging (MRI) to screening programmes, particularly for women at high risk. Recent guidelines by the European Society of Breast Imaging (EUSOBI) recommended breast MRI as a supplemental screening tool for women with dense breast tissue. However, acquiring and reading MRI scans requires significantly more time from expert radiologists. This highlights the need to develop new automated methods to detect cancer accurately using MRI and Artificial Intelligence (AI), which have the potential to support radiologists in breast MRI interpretation and classification and help detect cancer earlier. For this reason, the ODELIA consortium has made this multi-centre dataset publicly available to assist in developing AI tools for the detection of breast cancer on MRI.
Chinese: 乳腺癌的早期检测至关重要,虽然磁共振成像越来越多地与乳腺X光摄影结合用于高风险女性筛查,但它需要放射科医生投入更多时间,因此推动开发人工智能工具辅助诊断,ODELIA联盟为此公开了多中心数据集以支持相关研究。
English: Early detection of breast cancer is crucial, and while MRI is increasingly used alongside mammography for high-risk women, it demands more radiologist time, prompting the development of AI tools to aid in interpretation, with the ODELIA consortium releasing a public dataset to support this effort.
Authors:Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu
Abstract:
Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld, and demonstrate that Dyna-Think improves the agent's in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.
中文:Dyna-Think框架通过整合规划、世界模型和推理来提升AI智能体性能,在实现与R1相当效果的同时显著减少了令牌使用量。
English: The Dyna-Think framework enhances AI agent performance by integrating planning, world modeling, and reasoning, achieving results comparable to R1 with significantly reduced token usage.
Authors:Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu
Abstract:
Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna-Think improves the agent's in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.
中文:Dyna-Think框架通过整合规划、世界模型和推理来提升AI智能体性能,在实现与R1相当效果的同时显著减少了令牌使用量。
English: The Dyna-Think framework enhances AI agent performance by integrating planning, world modeling, and reasoning, achieving results comparable to R1 with significantly reduced token usage.
Authors:Konstantinos Bourazas, Savvas Papaioannou, Panayiotis Kolios
Abstract:
In this work we introduce a novel adaptive anomaly detection framework specifically designed for monitoring sequential random finite set (RFS) observations. Our approach effectively distinguishes between In-Control data (normal) and Out-Of-Control data (anomalies) by detecting deviations from the expected statistical behavior of the process. The primary contributions of this study include the development of an innovative RFS-based framework that not only learns the normal behavior of the data-generating process online but also dynamically adapts to behavioral shifts to accurately identify abnormal point patterns. To achieve this, we introduce a new class of RFS-based posterior distributions, named Power Discounting Posteriors (PD), which facilitate adaptation to systematic changes in data while enabling anomaly detection of point pattern data through a novel predictive posterior density function. The effectiveness of the proposed approach is demonstrated by extensive qualitative and quantitative simulation experiments.
中文: 本文提出了一种针对序列随机有限集观测的自适应异常检测框架,通过引入幂折扣后验分布实现在线学习系统变化并精确识别异常点模式。
English: This paper presents a novel adaptive anomaly detection framework for sequential random finite set observations, utilizing Power Discounting Posteriors to dynamically identify abnormal point patterns through online learning and behavioral adaptation.
Authors:Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou, Marios M. Polycarpou
Abstract:
Automated inspection with Unmanned Aerial Systems (UASs) is a transformative capability set to revolutionize various application domains. However, this task is inherently complex, as it demands the seamless integration of perception, planning, and control which existing approaches often treat separately. Moreover, it requires accurate long-horizon planning to predict action sequences, in contrast to many current techniques, which tend to be myopic. To overcome these limitations, we propose a 3D inspection approach that unifies perception, planning, and control within a single data-driven predictive control framework. Unlike traditional methods that rely on known UAS dynamic models, our approach requires only input-output data, making it easily applicable to off-the-shelf black-box UASs. Our method incorporates back-face elimination, a visibility determination technique from 3D computer graphics, directly into the control loop, thereby enabling the online generation of accurate, long-horizon 3D inspection trajectories.
Chinese: 我们提出的三维检测方法将感知、规划与控制整合在统一的数据驱动框架中,无需无人机动态模型即可为现成设备生成长期精确检测轨迹。
English: Our proposed 3D inspection approach integrates perception, planning, and control within a unified data-driven framework, enabling long-horizon trajectory generation for off-the-shelf UASs without requiring dynamic models.
Authors:Yusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito, Hiroshi Saruwatari
Abstract:
In text-to-audio (TTA) research, the relevance between input text and output audio is an important evaluation aspect. Traditionally, it has been evaluated from both subjective and objective perspectives. However, subjective evaluation is costly in terms of money and time, and objective evaluation is unclear regarding the correlation to subjective evaluation scores. In this study, we construct RELATE, an open-sourced dataset that subjectively evaluates the relevance. Also, we benchmark a model for automatically predicting the subjective evaluation score from synthesized audio. Our model outperforms a conventional CLAPScore model, and that trend extends to many sound categories.
中文: 本研究构建了RELATE开源数据集用于主观评估文本与音频相关性,并提出一种模型,该模型在多种声音类别中预测主观评分时优于传统CLAPScore模型。
English: This study introduces RELATE, an open-source dataset for subjective evaluation of text-audio relevance, and proposes a model that outperforms CLAPScore in predicting subjective scores across various sound categories.
Authors:Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
Abstract:
We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.
中文: 本文提出了一个用于评估深度搜索的新基准,这是一种需要跨多样化企业源进行多跳推理的复杂检索增强生成方法,结果显示即使最优方法也因检索瓶颈而表现不佳,平均得分仅为32.96。
English: This paper introduces a new benchmark for evaluating Deep Search, a complex form of retrieval-augmented generation that requires multi-hop reasoning across diverse enterprise sources, revealing that even top-performing methods struggle with retrieval as the main bottleneck, achieving only a 32.96 average score.
Authors:Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng
Abstract:
The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.
Chinese: BiMi是一种双语多模态框架,通过定位局部图像编辑和跨语言不一致性来增强虚假信息检测,利用GRPO和在线检索等先进技术,在准确性和解释质量上超越现有方法。
English: BiMi is a bilingual multimodal framework that enhances misinformation detection by identifying localized image edits and cross-lingual inconsistencies, outperforming existing methods in accuracy and explanation quality through advanced techniques like GRPO and online retrieval.
Authors:Jiangping Huang, Dongming Jin, Weisong Sun, Yang Liu, Zhi Jin
Abstract:
This paper envisions a knowledge-guided multi-agent framework named KGMAF for automated requirements development. KGMAF aims to address gaps in current automation systems for SE, which prioritize code development and overlook the complexities of requirements tasks. KGMAF is composed of six specialized agents and an artifact pool to improve efficiency and accuracy. Specifically, KGMAF outlines the functionality, actions, and knowledge of each agent and provides the conceptual design of the artifact pool. Our case study highlights the potential of KGMAF in real-world scenarios. Finally, we outline several research opportunities for implementing and enhancing automated requirements development using multi-agent systems. We believe that KGMAF will play a pivotal role in shaping the future of automated requirements development in the era of LLMs.
本文提出了KGMAF,这是一个知识引导的多智能体框架,旨在通过六个专门智能体和一个工件池来提升软件工程中自动化需求开发的效率和准确性。
This paper introduces KGMAF, a knowledge-guided multi-agent framework designed to enhance automated requirements development in software engineering by employing six specialized agents and an artifact pool to improve efficiency and accuracy.
Authors:Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Francesco Orabona, Sanmi Koyejo, David Donoho
Abstract:
Science progresses by iteratively advancing and correcting humanity's understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
中文: 本立场文件主张在机器学习会议中设立专门的"反驳与批评"环节,以系统性纠正错误研究并促进该领域的自我修正机制。
English: This position paper advocates for establishing a dedicated "Refutations and Critiques" track at machine learning conferences to systematically address flawed research and foster self-correction within the field.
Authors:Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong
Abstract:
Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.
中文: 本研究提出一种两阶段视频生成框架,通过优化退化策略、时间采样和高效注意力机制,使轻量级级联视频超分辨率模型能够增强低分辨率语义内容,为高分辨率输出建立了简洁有效的基准。
English: This study develops a two-stage video generation framework where a lightweight cascaded video super-resolution model enhances low-resolution semantic content through optimized degradation strategies, temporal sampling, and efficient attention mechanisms, establishing a simple yet effective baseline for high-resolution output.
Authors:Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong
Abstract:
Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.
中文: 本研究提出一种两阶段视频生成框架,通过优化退化策略、时间采样和高效注意力机制,使轻量级级联视频超分辨率模型能够增强低分辨率语义内容,为高分辨率输出建立了简洁有效的基准。
English: This study develops a two-stage video generation framework where a lightweight cascaded video super-resolution model enhances low-resolution semantic content through optimized degradation strategies, temporal sampling, and efficient attention mechanisms, establishing a simple yet effective baseline for high-resolution output.
Authors:Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce $ICBENCH$, a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread "scheming" behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.
中文摘要:大型推理模型在简单任务上存在过度思考的风险,尽管高效推理策略旨在减少令牌使用,但它们可能引发行为不一致,从而削弱模型的鲁棒性和透明度。
English Summary: Large Reasoning Models risk overthinking on simple tasks, and while efficient reasoning strategies aim to reduce token usage, they may introduce behavioral inconsistencies that compromise model robustness and transparency.
Authors:Pengxiang Li, Yuwei Wu, Zhi Gao, Xiaomeng Fan, Wei Wu, Zhipeng Lu, Yunde Jia, Mehrtash Harandi
Abstract:
Learning in hyperbolic spaces has attracted increasing attention due to its superior ability to model hierarchical structures of data. Most existing hyperbolic learning methods use fixed distance measures for all data, assuming a uniform hierarchy across all data points. However, real-world hierarchical structures exhibit significant diversity, making this assumption overly restrictive. In this paper, we propose a geometry-aware distance measure in hyperbolic spaces, which dynamically adapts to varying hierarchical structures. Our approach derives the distance measure by generating tailored projections and curvatures for each pair of data points, effectively mapping them to an appropriate hyperbolic space. We introduce a revised low-rank decomposition scheme and a hard-pair mining mechanism to mitigate the computational cost of pair-wise distance computation without compromising accuracy. We present an upper bound on the low-rank approximation error using Talagrand's concentration inequality, ensuring theoretical robustness. Extensive experiments on standard image classification (MNIST, CIFAR-10 and CIFAR-100), hierarchical classification (5-level CIFAR-100), and few-shot learning tasks (mini-ImageNet, tiered-ImageNet) demonstrate the effectiveness of our method. Our approach consistently outperforms learning methods that use fixed distance measures, with notable improvements on few-shot learning tasks, where it achieves over 5\% gains on mini-ImageNet. The results reveal that adaptive distance measures better capture diverse hierarchical structures, with visualization showing clearer class boundaries and improved prototype separation in hyperbolic spaces.
Chinese: 本文提出了一种双曲空间中的几何感知距离度量方法,通过为每对数据点生成定制化投影和曲率来动态适应多样化的层次结构,在少样本学习任务中显著优于固定距离方法,取得了超过5%的性能提升。
English: This paper introduces a geometry-aware distance measure in hyperbolic spaces that dynamically adapts to diverse hierarchical structures by generating tailored projections and curvatures for each data pair, outperforming fixed-distance methods with notable gains in few-shot learning tasks.
Authors:Pengxiang Li, Wei Wu, Zhi Gao, Xiaomeng Fan, Peilin Yu, Yuwei Wu, Zhipeng Lu, Yunde Jia, Mehrtash Harandi
Abstract:
We propose a hyperbolic set-to-set distance measure for computing dissimilarity between sets in hyperbolic space. While point-to-point distances in hyperbolic space effectively capture hierarchical relationships between data points, many real-world applications require comparing sets of hyperbolic data points, where the local structure and the global structure of the sets carry crucial semantic information. The proposed the \underline{h}yperbolic \underline{s}et-\underline{to}-\underline{s}et \underline{d}istance measure (HS2SD) integrates both global and local structural information: global structure through geodesic distances between Einstein midpoints of hyperbolic sets, and local structure through topological characteristics of the two sets. To efficiently compute topological differences, we prove that using a finite Thue-Morse sequence of degree and adjacency matrices can serve as a robust approximation to capture the topological structure of a set. In this case, by considering the topological differences, HS2SD provides a more nuanced understanding of the relationships between two hyperbolic sets. Empirical evaluation on entity matching, standard image classification, and few-shot image classification demonstrates that our distance measure outperforms existing methods by effectively modeling the hierarchical and complex relationships inherent in hyperbolic sets.
中文摘要:提出的双曲集合间距离(HS2SD)通过整合双曲集合的全局和局部结构信息,在实体匹配和图像分类任务中优于现有方法,能更有效地建模层次化关系。
English Summary: The proposed hyperbolic set-to-set distance (HS2SD) captures both global and local structural information between hyperbolic sets, outperforming existing methods in tasks like entity matching and image classification by better modeling hierarchical relationships.
Authors:Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng
Abstract:
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on huggingface.co/datasets/JT-LM/JIUTIAN-TReB and the framework on github.com/JT-LM/jiutian-treb.
Chinese: 本文提出了TReB这一全面评估大语言模型表格推理能力的基准,涵盖26个子任务,尽管测试了20多个先进模型,实验结果表明现有模型在处理复杂现实表格任务方面仍有显著提升空间。
English: This paper introduces TReB, a comprehensive benchmark for evaluating large language models' table reasoning abilities across 26 sub-tasks, revealing significant room for improvement despite testing over 20 state-of-the-art models.
Authors:Haoming Chen, Lichen Yuan, TianFang Sun, Jingyu Gong, Xin Tan, Zhizhong Zhang, Yuan Xie
Abstract:
3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet
中文: 本文提出了一种仅利用网络室内视频的自监督方法,无需相机参数即可实现精确的3D语义占据预测,并在多个基准测试中展现了领先的零样本性能。
English: This paper introduces a self-supervised method using only web-sourced indoor videos to achieve accurate 3D semantic occupancy prediction, eliminating the need for camera parameters and demonstrating state-of-the-art zero-shot performance on benchmarks.
Authors:Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
Abstract:
This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.
中文: 本文提出TTSOps框架,通过动态选择数据清洗方法和优化语料库与模型的交互,从嘈杂的网络规模"暗数据"中自动构建多说话人语音合成系统,在语音自然度和说话人多样性方面均优于传统方法。
English: This paper introduces TTSOps, an automated closed-loop framework that constructs multi-speaker TTS systems from noisy web-scale "dark data" by dynamically selecting data cleansing methods and optimizing corpus-model interactions, outperforming conventional approaches in speech naturalness and speaker diversity.
Authors:Jiale Xu, Rui Zhang, Yi Xiong, Cong Guo, Zihan Liu, Yangjie Zhou, Weiming Hu, Hao Wu, Changxu Shao, Ziqing Wang, Yongjie Yuan, Junping Zhao, Minyi Guo, Jingwen Leng
Abstract:
Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput.
To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems. The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs.
中文: 大型语言模型因静态权重与动态组件的分离管理面临内存挑战,而eLLM通过统一弹性框架显著提升了吞吐量和批处理能力。
English: Large Language Models face memory management challenges due to isolated handling of static weights and dynamic components, which eLLM addresses through a unified elastic framework that boosts throughput and batch capacity.
Authors:Jiayin Wang, Zhiquang Guo, Weizhi Ma, Min Zhang
Abstract:
As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.
中文: 本研究主张通过语义游戏评估大语言模型的测试时学习能力,结果表明虽然模型表现出可测量的学习能力,但其进步稳定性和速度均不及人类,揭示了模型作为学习机器潜力之外仍存在显著智能差距。
English: This study advocates for evaluating large language models' test-time learning ability using semantic games, revealing that while models show measurable learning, their progress is less stable and slower than humans, highlighting a significant intellectual gap despite their potential as learning machines.
Authors:Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin
Abstract:
Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.
中文: StreamMel是一种创新的单阶段流式文本转语音框架,通过建模连续梅尔频谱实现低延迟合成,在保持与离线系统相当质量的同时支持实时生成。
English: StreamMel is a novel single-stage streaming TTS framework that generates continuous mel-spectrograms with low latency, achieving quality comparable to offline systems while enabling real-time synthesis.
Authors:Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
Abstract:
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC's strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their cross-lingual performance.
Chinese: 本文发现大语言模型中间层存在内在表征对齐,并提出推理时语言控制方法,通过潜在注入实现精确的跨语言控制,在保持语义完整性的同时有效缓解语言混淆问题。
English: This paper identifies inherent representation alignment in LLMs' middle layers and introduces Inference-Time Language Control (ITLC), a latent injection method enabling precise cross-lingual control while preserving semantic integrity and mitigating language confusion.
Authors:Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
Abstract:
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC's strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.
Chinese: 本文发现大语言模型中间层存在内在表征对齐,并提出推理时语言控制方法,通过潜在注入实现精确的跨语言控制,在保持语义完整性的同时有效缓解语言混淆问题。
English: This paper identifies inherent representation alignment in LLMs' middle layers and introduces Inference-Time Language Control (ITLC), a latent injection method enabling precise cross-lingual control while preserving semantic integrity and mitigating language confusion.
Authors:Xintong Wang, Jingheng Pan, Yixiao Liu, Xiaohu Zhao, Chenyang Lyu, Minghao Wu, Chris Biemann, Longyue Wang, Linlong Xu, Weihua Luo, Kaifu Zhang
Abstract:
Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of their performance on VLT. In this work, we present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics. (1) We identify critical limitations in existing datasets, particularly in semantic and cultural fidelity, and introduce AibTrans -- a multilingual, parallel, human-verified dataset with OCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6 state-of-the-art open-source models across end-to-end and cascaded architectures, revealing their OCR dependency and contrasting generation versus reasoning behaviors. (3) We propose Density-Aware Evaluation to address metric reliability issues under varying contextual complexity, introducing the DA Score as a more robust measure of translation quality. Building upon these findings, we establish a new evaluation benchmark for VLT. Notably, we observe that fine-tuning LVLMs on high-resource language pairs degrades cross-lingual performance, and we propose a balanced multilingual fine-tuning strategy that effectively adapts LVLMs to VLT without sacrificing their generalization ability.
Chinese: 本研究通过引入人工验证数据集解决数据质量问题,对多种模型进行基准测试揭示其OCR依赖性,并提出改进的评估指标与微调策略,系统评估了视觉语言翻译任务并提升跨语言性能。
English: This study systematically evaluates Vision-Language Translation (VLT) by addressing data quality issues with a new human-verified dataset, benchmarking diverse models to reveal their OCR dependency, and proposing improved evaluation metrics and fine-tuning strategies for enhanced cross-lingual performance.
Authors:Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli
Abstract:
Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
中文:DySS提出了一种基于摄像头的3D物体检测新方法,通过状态空间学习和动态查询机制,在自动驾驶感知任务中实现了卓越性能与实时效率。
English: DySS introduces a novel camera-based 3D object detection method using state-space learning and dynamic queries, achieving superior performance and real-time efficiency in autonomous driving perception.
Authors:Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli
Abstract:
End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.
中文摘要:RoCA是一种新颖的端到端自动驾驶框架,通过高斯过程建模驾驶令牌的联合概率分布,无需额外推理计算即可显著提升跨领域泛化能力和适应性能。
English Summary: RoCA is a novel framework that enhances end-to-end autonomous driving by modeling joint probabilistic distributions of driving tokens with Gaussian processes, improving cross-domain generalization and adaptation without extra inference costs.
Authors:Haowen Wang, Xiaoping Yuan, Zhao Jin, Zhen Zhao, Zhengping Che, Yousong Xue, Jin Tian, Yakun Huang, Jian Tang
Abstract:
Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deformable 3D Gaussian fields, embedding geometry, appearance, and motion in one compact representation. Each interaction state is modeled as a smooth deformation of a shared field, and the resulting deformation trajectories guide a progressive coarse-to-fine part segmentation that identifies distinct rigid components, all in an unsupervised manner. The refined field provides a spatially continuous, fully decoupled description of every part, supporting part-level reconstruction and precise modeling of their kinematic relationships. To evaluate generalization and realism, we enlarge the synthetic PartNet-Mobility benchmark and release RS-Art, a real-to-sim dataset that pairs RGB captures with accurately reverse-engineered 3D models. Extensive experiments demonstrate that our method outperforms existing methods in both accuracy and stability.
中文: DeGSS是一个统一框架,将铰接物体建模为可变形3D高斯场,将几何、外观和运动集成在一个紧凑表示中,实现无监督部件分割,并在准确性和稳定性上超越现有方法。
English: DeGSS is a unified framework that models articulated objects as deformable 3D Gaussian fields, integrating geometry, appearance, and motion into a single compact representation, enabling unsupervised part segmentation and outperforming existing methods in accuracy and stability.
Authors:Yunxiao Shi, Yinhao Zhu, Shizhong Han, Jisoo Jeong, Amin Ansari, Hong Cai, Fatih Porikli
Abstract:
Occupancy prediction infers fine-grained 3D geometry and semantics from camera images of the surrounding environment, making it a critical perception task for autonomous driving. Existing methods either adopt dense grids as scene representation, which is difficult to scale to high resolution, or learn the entire scene using a single set of sparse queries, which is insufficient to handle the various object characteristics. In this paper, we present ODG, a hierarchical dual sparse Gaussian representation to effectively capture complex scene dynamics. Building upon the observation that driving scenes can be universally decomposed into static and dynamic counterparts, we define dual Gaussian queries to better model the diverse scene objects. We utilize a hierarchical Gaussian transformer to predict the occupied voxel centers and semantic classes along with the Gaussian parameters. Leveraging the real-time rendering capability of 3D Gaussian Splatting, we also impose rendering supervision with available depth and semantic map annotations injecting pixel-level alignment to boost occupancy learning. Extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks demonstrate our proposed method sets new state-of-the-art results while maintaining low inference cost.
中文: 本文提出的ODG采用分层双稀疏高斯表示,将驾驶场景分解为静态和动态部分来有效建模复杂场景,在保持低推理成本的同时于多个基准测试中取得了最优性能。
English: The paper introduces ODG, a hierarchical dual sparse Gaussian representation that effectively models complex driving scenes by decomposing them into static and dynamic components, achieving state-of-the-art performance on benchmarks with low inference cost.
Authors:Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma
Abstract:
We propose a feed-forward Gaussian Splatting model that unifies 3D scene and semantic field reconstruction. Combining 3D scenes with semantic fields facilitates the perception and understanding of the surrounding environment. However, key challenges include embedding semantics into 3D representations, achieving generalizable real-time reconstruction, and ensuring practical applicability by using only images as input without camera parameters or ground truth depth. To this end, we propose UniForward, a feed-forward model to predict 3D Gaussians with anisotropic semantic features from only uncalibrated and unposed sparse-view images. To enable the unified representation of the 3D scene and semantic field, we embed semantic features into 3D Gaussians and predict them through a dual-branch decoupled decoder. During training, we propose a loss-guided view sampler to sample views from easy to hard, eliminating the need for ground truth depth or masks required by previous methods and stabilizing the training process. The whole model can be trained end-to-end using a photometric loss and a distillation loss that leverages semantic features from a pre-trained 2D semantic model. At the inference stage, our UniForward can reconstruct 3D scenes and the corresponding semantic fields in real time from only sparse-view images. The reconstructed 3D scenes achieve high-quality rendering, and the reconstructed 3D semantic field enables the rendering of view-consistent semantic features from arbitrary views, which can be further decoded into dense segmentation masks in an open-vocabulary manner. Experiments on novel view synthesis and novel view segmentation demonstrate that our method achieves state-of-the-art performances for unifying 3D scene and semantic field reconstruction.
Chinese: 我们提出了UniForward前馈模型,仅通过稀疏视角图像即可实时重建3D场景和语义场,无需相机参数或深度真值,在统一三维场景与语义场重建方面实现了最先进的性能。
English: We introduce UniForward, a feed-forward model that reconstructs 3D scenes and semantic fields in real time from sparse-view images, achieving state-of-the-art performance without requiring camera parameters or depth data.
Authors:Peilin Yu, Yuwei Wu, Zhi Gao, Xiaomeng Fan, Shuo Yang, Yunde Jia
Abstract:
Feature augmentation generates novel samples in the feature space, providing an effective way to enhance the generalization ability of learning algorithms with hyperbolic geometry. Most hyperbolic feature augmentation is confined to closed-environment, assuming the number of classes is fixed (\emph{i.e.}, seen classes) and generating features only for these classes. In this paper, we propose a hyperbolic dual feature augmentation method for open-environment, which augments features for both seen and unseen classes in the hyperbolic space. To obtain a more precise approximation of the real data distribution for efficient training, (1) we adopt a neural ordinary differential equation module, enhanced by meta-learning, estimating the feature distributions of both seen and unseen classes; (2) we then introduce a regularizer to preserve the latent hierarchical structures of data in the hyperbolic space; (3) we also derive an upper bound for the hyperbolic dual augmentation loss, allowing us to train a hyperbolic model using infinite augmentations for seen and unseen classes. Extensive experiments on five open-environment tasks: class-incremental learning, few-shot open-set recognition, few-shot learning, zero-shot learning, and general image classification, demonstrate that our method effectively enhances the performance of hyperbolic algorithms in open-environment.
中文: 本文提出了一种双曲双重特征增强方法,在开放环境中对可见和未见类进行特征增强,通过神经常微分方程和元学习提升双曲模型在多种任务中的性能。
English: This paper introduces a hyperbolic dual feature augmentation method that enhances both seen and unseen classes in open environments, utilizing neural ODEs and meta-learning to improve hyperbolic model performance across various tasks.
Authors:Yifei Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, Zhiyuan Xu, Zhengping Che, Jian Tang
Abstract:
Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policies by adapting acceleration techniques originally developed for image generation. Despite this progress, a major distinction remains: image generation typically involves producing independent samples without temporal dependencies, whereas robotic manipulation involves generating time-series action trajectories that require continuity and temporal coherence. To effectively exploit temporal information in robotic manipulation, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. We introduce a frequency consistency constraint that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on 53 tasks across 3 simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on the 40 tasks of Libero. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz. The code will be publicly available.
中文: FreqPolicy通过引入频率一致性约束到基于流的视觉运动策略中,在保持机器人操作任务时间连贯性的同时,实现了高效的单步动作生成。
English: FreqPolicy introduces frequency consistency constraints to flow-based visuomotor policies, enabling efficient one-step action generation while maintaining temporal coherence in robotic manipulation tasks.
Authors:Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, Jianzong Wang
Abstract:
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
中文:提出的MoQAE方法通过混合精度量化与分块路由机制,结合轻量化微调过程,有效解决了大语言模型中键值缓存的内存效率问题,在精度和效率方面均优于现有量化方法。
English: The proposed MoQAE method addresses the memory inefficiency of KV cache in large language models by introducing a mixed-precision quantization approach with chunk-based routing and lightweight fine-tuning, achieving superior performance in both efficiency and accuracy compared to existing methods.
Authors:Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, Fatih Porikli
Abstract:
3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird's Eye View (BEV) or sparse points as scene representation with much reduced cost, but still suffer from their respective shortcomings. More concretely, BEV struggles with small objects that often experience significant information loss after being projected to the ground plane. On the other hand, points can flexibly model little objects in 3D, but is inefficient at capturing flat surfaces or large objects. To address these challenges, in this paper, we present a novel 3D occupancy prediction approach, BePo, which combines BEV and sparse points based representations. We propose a dual-branch design: a query-based sparse points branch and a BEV branch. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which enriches the weakened signals of difficult objects on the BEV plane. The outputs of both branches are finally fused to generate predicted 3D occupancy. We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks that demonstrate the superiority of our proposed BePo. Moreover, BePo also delivers competitive inference speed when compared to the latest efficient approaches.
中文: 本文提出BePo方法,通过结合鸟瞰图和稀疏点云的双分支设计,在提升小物体检测能力的同时保持高效推理速度,在多个基准测试中优于现有方法。
English: The paper introduces BePo, a dual-branch 3D occupancy prediction method that integrates Bird's Eye View and sparse points representations to enhance small object detection while maintaining efficient inference speed, outperforming existing approaches on benchmarks.
Authors:Shiying Duan, Pei Ren, Nanxiang Jiang, Zhengping Che, Jian Tang, Yifan Sun, Zhaoxin Fan, Wenjun Wu
Abstract:
Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios. While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration. To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning. RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence. In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels. Extensive experiments on the X-DAPT dataset demonstrate that RoboPARA significantly outperforms existing methods, achieving higher efficiency and reliability, particularly in complex task combinations. The code and dataset will be released upon acceptance.
Chinese: RoboPARA是一种新颖的基于大语言模型的双臂任务并行规划框架,通过依赖图规划和图重遍历两阶段方法优化双臂协作效率,在X-DAPT数据集上的实验表明其显著优于现有方法,尤其在复杂任务组合中表现优异。
English: RoboPARA is a novel LLM-driven framework that enhances dual-arm robot task parallelism through a two-stage process of dependency graph planning and optimized graph traversal, significantly outperforming existing methods in efficiency and reliability as demonstrated on the new X-DAPT dataset.
Authors:Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac, Bernhard Schölkopf
Abstract:
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature. This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox. We analyze current LLM capabilities for physics -- from mathematical reasoning to code generation -- identifying critical gaps in physical intuition, constraint satisfaction, and reliable reasoning. We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments. Realizing this vision requires addressing fundamental challenges: ensuring physical consistency, and developing robust verification methods. We call for collaborative efforts between physics and AI communities to help advance scientific discovery in physics.
中文摘要:大型语言模型在与领域工具结合后有望推动物理学研究,但需弥补物理直觉和推理一致性方面的不足才能实现可靠的科学应用。
English Summary: LLM agents show promise for accelerating physics research through integration with domain tools, but require addressing gaps in physical intuition and reasoning consistency to achieve reliable scientific applications.
Authors:Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Stefano Mattoccia, Zhe Zhang, Yang Yang, Wu Chen, Anlong Ming, Mingshuai Zhao, Mengying Yu, Shida Gao, Xiangfeng Wang, Feng Xue, Jun Shi, Yong Yang, Yong A, Yixiang Jin, Dingzhe Li, Aryan Shukla, Liam Frija-Altarac, Matthew Toews, Hui Geng, Tianjiao Wan, Zijian Gao, Qisheng Xu, Kele Xu, Zijian Zang, Jameer Babu Pinjari, Kuldeep Purohit, Mykola Lavreniuk, Jing Cao, Shenyi Li, Kui Jiang, Junjun Jiang, Yong Huang
Abstract:
This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The challenge proposes two tracks on stereo and single-image depth estimation, attracting about 177 registered participants. In the final testing stage, 4 and 4 participating teams submitted their models and fact sheets for the two tracks.
中文: NTIRE 2025挑战赛针对高分辨率镜面和透明表面的深度估计,设立了立体和单图像两个赛道,吸引了177名参与者,最终有8份提交进入测试阶段。
English: The NTIRE 2025 challenge focuses on high-resolution depth estimation from specular and transparent surfaces through stereo and single-image tracks, attracting 177 participants with 8 final submissions.
Authors:Yu Feng, Weikai Lin, Yuge Cheng, Zihan Liu, Jingwen Leng, Minyi Guo, Chen Chen, Shixuan Sun, Yuhao Zhu
Abstract:
3D Gaussian Splatting (3DGS) has vastly advanced the pace of neural rendering, but it remains computationally demanding on today's mobile SoCs. To address this challenge, we propose Lumina, a hardware-algorithm co-designed system, which integrates two principal optimizations: a novel algorithm, S^2, and a radiance caching mechanism, RC, to improve the efficiency of neural rendering. S2 algorithm exploits temporal coherence in rendering to reduce the computational overhead, while RC leverages the color integration process of 3DGS to decrease the frequency of intensive rasterization computations. Coupled with these techniques, we propose an accelerator architecture, LuminCore, to further accelerate cache lookup and address the fundamental inefficiencies in Rasterization. We show that Lumina achieves 4.5x speedup and 5.3x energy reduction against a mobile Volta GPU, with a marginal quality loss (< 0.2 dB peak signal-to-noise ratio reduction) across synthetic and real-world datasets.
Chinese: Lumina是一种硬件算法协同设计的系统,通过S²算法和辐射缓存优化3D高斯泼溅,在移动SoC上实现了4.5倍加速和5.3倍能耗降低,且质量损失极小。
English: Lumina is a hardware-algorithm co-designed system that introduces the S² algorithm and radiance caching to accelerate 3D Gaussian Splatting, achieving a 4.5x speedup and 5.3x energy reduction on mobile SoCs with minimal quality loss.
Authors:Zhuoxuan Cai, Jian Zhang, Xinbin Yuan, Peng-Tao Jiang, Wenxiang Chen, Bowen Tang, Lujian Yao, Qiyuan Wang, Jinwen Chen, Bo Li
Abstract:
Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.
Chinese: 本研究提出一个统一的双阶段训练框架,通过联合优化视觉质量评分与推理能力,使多模态大语言模型在质量评分准确性和可解释性评估方面均达到最优性能。
English: This study introduces a unified two-stage training framework for multimodal large language models (MLLMs) that jointly optimizes visual quality scoring and reasoning, achieving state-of-the-art performance in both score regression and interpretable assessments.
Authors:Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang
Abstract:
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
Chinese: SeedVR2是一种基于扩散的单步视频修复模型,通过自适应窗口注意力机制和优化的训练损失函数,能够高效处理高分辨率视频并实现优越性能。
English: SeedVR2 is a one-step diffusion-based video restoration model that introduces adaptive window attention and enhanced training losses to efficiently handle high-resolution videos while achieving competitive performance.
Authors:Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo
Abstract:
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability.
We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).
中文:Astraea框架通过引入轻量级令牌选择机制和内存高效的稀疏注意力策略,在保持生成质量的同时显著提升了视频扩散变换器的推理速度。
English: Astraea is a framework that optimizes video diffusion transformers by introducing a token selection mechanism and sparse attention strategy, achieving significant speed improvements with minimal quality loss.
Authors:Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo
Abstract:
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high compute demands pose a major challenge for practical deployment. While studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-friendly sparse attention strategy, enabling linear savings on execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, Astraea achieves up to 2.4$\times$ inference speedup on a single GPU with great scalability (up to 13.2$\times$ speedup on 8 GPUs) while achieving up to over 10~dB video quality compared to the state-of-the-art methods ($<$0.5\% loss on VBench compared to baselines).
中文:Astraea框架通过引入轻量级令牌选择机制和内存高效的稀疏注意力策略,在保持生成质量的同时显著提升了视频扩散变换器的推理速度。
English: Astraea is a framework that optimizes video diffusion transformers by introducing a token selection mechanism and sparse attention strategy, achieving significant speed improvements with minimal quality loss.
Authors:Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang
Abstract:
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .
中文摘要:ArtVIP是一个高质量开源铰接物体数据集,通过逼真的视觉和物理特性弥合仿真与现实的差距,推动机器人学习研究。
English Summary: ArtVIP is an open-source dataset of high-quality articulated objects with realistic visuals and physics, designed to bridge the sim-to-real gap in robot learning.
Authors:Ziqi Jia, Anmin Wang, Xiaoyang Qu, Xiaowen Yang, Jianzong Wang
Abstract:
Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent's continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.
中文摘要:提出的分层具身持续学习(HEC)框架通过将高层规划与底层动作分离,并采用任务感知的MoILE方法,利用双路由器和基于奇异值分解的参数保留技术,有效解决了传统方法中的灾难性遗忘问题。
English Summary: The proposed Hierarchical Embodied Continual Learning (HEC) framework addresses limitations in existing setups by separating high-level planning from low-level actions and introducing a Task-aware MoILE method that mitigates catastrophic forgetting through dual routers and SVD-based parameter preservation.
Authors:Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli
Abstract:
Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).
中文摘要:FALO是一种硬件友好的激光雷达3D检测方法,通过将体素排列为一维序列并使用ConvDotMix模块处理,在保持顶尖检测精度的同时实现快速推理,适用于资源受限的边缘设备。
English Summary: FALO is a hardware-friendly LiDAR 3D detection method that achieves state-of-the-art accuracy and fast inference speed by processing voxels as 1D sequences through ConvDotMix blocks, making it suitable for resource-constrained edge devices.
Authors:Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, Jian Tang
Abstract:
Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitation hinders natural and robust interaction in dynamic settings, such as retail or household environments, where real-time intent changes are common. We propose SwitchVLA, a unified, execution-aware framework that enables smooth and reactive task switching without external planners or additional switch-specific data. We model task switching as a behavior modulation problem conditioned on execution state and instruction context. Expert demonstrations are segmented into temporally grounded contact phases, allowing the policy to infer task progress and adjust its behavior accordingly. A multi-behavior conditional policy is then trained to generate flexible action chunks under varying behavior modes through conditioned trajectory modeling. Experiments in both simulation and real-world robotic manipulation demonstrate that SwitchVLA enables robust instruction adherence, fluid task switching, and strong generalization-outperforming prior VLA baselines in both task success rate and interaction naturalness.
中文摘要:SwitchVLA是一个执行感知框架,通过行为调节和条件轨迹建模使机器人能够在执行过程中动态适应用户指令变化,在任务成功率和交互自然度上均优于现有方法。
English Summary: SwitchVLA is an execution-aware framework that enables robots to dynamically adapt to changing user instructions mid-execution through behavior modulation and conditioned trajectory modeling, outperforming existing methods in task success and interaction fluency.
Authors:Jun Rao, Zepeng Lin, Xuebo Liu, Xiaopeng Ke, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang
Abstract:
Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model's existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model's broader applicability.
中文: 本文提出APT方法,通过仅针对错误样本进行训练,利用自生成的弱点数据提升大语言模型的领域性能,同时有效保持其通用能力。
English: This paper introduces APT, a method that uses self-generated weakness data to enhance domain-specific performance in LLMs while preserving their general capabilities by focusing training only on error-prone samples.
Authors:Leyla Mirvakhabova, Hong Cai, Jisoo Jeong, Hanno Ackermann, Farhad Zanjani, Fatih Porikli
Abstract:
Recent works on optical flow estimation use neural networks to predict the flow field that maps positions of one image to positions of the other. These networks consist of a feature extractor, a correlation volume, and finally several refinement steps. These refinement steps mimic the iterative refinements performed by classical optimization algorithms and are usually implemented by neural layers (e.g., GRU) which are recurrently executed for a fixed and pre-determined number of steps. However, relying on a fixed number of steps may result in suboptimal performance because it is not tailored to the input data. In this paper, we introduce a novel approach for predicting the derivative of the flow using a continuous model, namely neural ordinary differential equations (ODE). One key advantage of this approach is its capacity to model an equilibrium process, dynamically adjusting the number of compute steps based on the data at hand. By following a particular neural architecture, ODE solver, and associated hyperparameters, our proposed model can replicate the exact same updates as recurrent cells used in existing works, offering greater generality. Through extensive experimental analysis on optical flow benchmarks, we demonstrate that our approach achieves an impressive improvement over baseline and existing models, all while requiring only a single refinement step.
中文: 本文提出一种基于神经微分方程的光流估计方法,能根据输入数据动态调整计算步骤,仅需单次优化即可超越传统固定步长模型的性能。
English: This paper introduces a neural ODE-based method that dynamically adjusts computation steps for optical flow estimation, achieving superior performance with just one refinement step compared to fixed-step approaches.
Authors:Zeliang Zhang, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Chenliang Xu
Abstract:
Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, compromising their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class, without access to pre-trained data, while preserving the model's overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. It consists of (1) a forgetting stage to fine-tune the CLIP on samples to be forgotten, (2) a reminding stage to restore performance on retained samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. Additionally, we introduce knowledge distillation to handle the distribution disparity between forgetting, retaining samples, and unseen pre-trained data. Extensive experiments on CIFAR-10, ImageNet-1K, and style datasets demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on semantically similar subgroups and other categories, significantly outperforming baseline unlearning methods, which lose effectiveness under the CLIP unlearning setting.
中文:基础模型如CLIP存在从噪声数据集中继承有害知识的问题,为此我们提出了一种新颖的三阶段遗忘方法,无需预训练数据即可选择性遗忘特定知识,同时保持整体性能。
English: Foundation models like CLIP face challenges with unwanted knowledge from noisy datasets, prompting the development of a novel three-stage unlearning method that selectively forgets specific knowledge without access to pre-training data while preserving overall performance.
Authors:Zheng Liu, He Zhu, Xinyang Li, Yirun Wang, Yujiao Shi, Wei Li, Jingwen Leng, Minyi Guo, Yu Feng
Abstract:
3D Gaussian Splatting (3DGS) is an emerging technique for photorealistic 3D scene rendering. However, rendering city-scale 3DGS scenes on mobile devices, e.g., your smartphones, remains a significant challenge due to the limited resources on mobile devices. A natural solution is to offload computation to the cloud; however, naively streaming rendered frames from the cloud to the client introduces high latency and requires bandwidth far beyond the capacity of current wireless networks.
In this paper, we propose an effective solution to enable city-scale 3DGS rendering on mobile devices. Our key insight is that, under normal user motion, the number of newly visible Gaussians per second remains roughly constant. Leveraging this, we stream only the necessary Gaussians to the client. Specifically, on the cloud side, we propose asynchronous level-of-detail search to identify the necessary Gaussians for the client. On the client side, we accelerate rendering via a lookup table-based rasterization. Combined with holistic runtime optimizations, our system can deliver low-latency, city-scale 3DGS rendering on mobile devices. Compared to existing solutions, Voyager achieves over 100$\times$ reduction on data transfer and up to 8.9$\times$ speedup while retaining comparable rendering quality.
Chinese: Voyager通过时序感知的细节层次搜索和抢占式α滤波技术,在移动设备上实现了城市级3D高斯溅射渲染的加速,最高提速6.6倍并节省85%能耗,同时保持卓越的渲染质量。
English: Voyager accelerates city-scale 3D Gaussian splatting rendering on mobile devices by employing temporal-aware level-of-detail search and preemptive α-filtering, achieving up to 6.6× speedup and 85% energy savings with high-quality output.
Authors:Zheng Liu, He Zhu, Xinyang Li, Yirun Wang, Yujiao Shi, Yiming Gan, Wei Li, Jingwen Leng, Minyi Guo, Yu Feng
Abstract:
3D Gaussian splatting (3DGS) is an emerging technique for photorealistic 3D scene rendering. However, rendering city-scale 3DGS scenes on resource-constrained mobile devices in real-time remains a significant challenge due to two compute-intensive stages: level-of-detail (LoD) search and rasterization. In this paper, we propose Voyager, an effective solution to accelerate city-scale 3DGS rendering on mobile devices. Our key insight is that, under normal user motion, the number of newly visible Gaussians within the view frustum remains roughly constant. Leveraging this temporal correlation, we propose a temporal-aware LoD search to identify the necessary Gaussians for the remaining rendering stages. For the remaining rendering process, we accelerate the bottleneck stage, rasterization, via preemptive $α$-filtering. With all optimizations above, our system can deliver low-latency, city-scale 3DGS rendering on mobile devices. Compared to existing solutions, Voyager achieves up to 6.6$\times$ speedup and 85\% energy savings with superior rendering quality.
Chinese: Voyager通过时序感知的细节层次搜索和抢占式α滤波技术,在移动设备上实现了城市级3D高斯溅射渲染的加速,最高提速6.6倍并节省85%能耗,同时保持卓越的渲染质量。
English: Voyager accelerates city-scale 3D Gaussian splatting rendering on mobile devices by employing temporal-aware level-of-detail search and preemptive α-filtering, achieving up to 6.6× speedup and 85% energy savings with high-quality output.
Authors:Junjie Li, Nan Zhang, Xiaoyang Qu, Kai Lu, Guokuan Li, Jiguang Wan, Jianzong Wang
Abstract:
Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient termination.RATE-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.
中文摘要:RATE-Nav提出了一种区域感知的终止增强方法,通过视觉语言模型和探索率计算实现高效终止探索,在HM3D和MP3D数据集上的物体导航任务中取得了显著性能提升。
English Summary: RATE-Nav introduces a region-aware termination method that uses visual language models and exploration rate calculations to efficiently end exploration, achieving significant performance improvements in object navigation tasks on both HM3D and MP3D datasets.
Authors:Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang
Abstract:
Stickers, though small, are a highly condensed form of visual expression, ubiquitous across messaging platforms and embraced by diverse cultures, genders, and age groups. Despite their popularity, sticker retrieval remains an underexplored task due to the significant human effort and subjectivity involved in constructing high-quality sticker query datasets. Although large language models (LLMs) excel at general NLP tasks, they falter when confronted with the nuanced, intangible, and highly specific nature of sticker query generation.
To address this challenge, we propose a threefold solution. First, we introduce Sticktionary, a gamified annotation framework designed to gather diverse, high-quality, and contextually resonant sticker queries. Second, we present StickerQueries, a multilingual sticker query dataset containing 1,115 English and 615 Chinese queries, annotated by over 60 contributors across 60+ hours. Lastly, Through extensive quantitative and qualitative evaluation, we demonstrate that our approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain. To support future research, we publicly release our multilingual dataset along with two fine-tuned query generation models.
中文摘要:本研究提出游戏化标注框架Sticktionary和多语言数据集StickerQueries,有效解决了表情贴纸查询生成与检索的难题,通过全面评估显著提升了查询质量与检索精度。
English Summary: This study introduces Sticktionary, a gamified annotation framework, and StickerQueries, a multilingual dataset, to address the challenges in sticker query generation and retrieval, significantly improving quality and accuracy through extensive evaluation.
Authors:Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang
Abstract:
Stickers, though small, are a highly condensed form of visual expression, ubiquitous across messaging platforms and embraced by diverse cultures, genders, and age groups. Despite their popularity, sticker retrieval remains an underexplored task due to the significant human effort and subjectivity involved in constructing high-quality sticker query datasets. Although large language models (LLMs) excel at general NLP tasks, they falter when confronted with the nuanced, intangible, and highly specific nature of sticker query generation. To address this challenge, we propose a threefold solution. First, we introduce Sticktionary, a gamified annotation framework designed to gather diverse, high-quality, and contextually resonant sticker queries. Second, we present StickerQueries, a multilingual sticker query dataset containing 1,115 English and 615 Chinese queries, annotated by over 60 contributors across 60+ hours. Lastly, Through extensive quantitative and qualitative evaluation, we demonstrate that our approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain. To support future research, we publicly release our multilingual dataset along with two fine-tuned query generation models.
中文摘要:本研究提出游戏化标注框架Sticktionary和多语言数据集StickerQueries,有效解决了表情贴纸查询生成与检索的难题,通过全面评估显著提升了查询质量与检索精度。
English Summary: This study introduces Sticktionary, a gamified annotation framework, and StickerQueries, a multilingual dataset, to address the challenges in sticker query generation and retrieval, significantly improving quality and accuracy through extensive evaluation.
Authors:Yu-Fei Shi, Yang Ai, Zhen-Hua Ling
Abstract:
To compare the performance of two speech generation systems, one of the most effective approaches is estimating the preference score between their generated speech. This paper proposes a novel universal preference-score-based pairwise speech quality assessment (UPPSQA) model, aimed at predicting the preference score between paired speech samples to determine which one has better quality. The model first predicts the absolute mean opinion score (MOS) for the two speech samples separately, and then aggregates them into a relative preference score using a preference function. To address the scarcity of preference data, we also construct a new pairwise speech dataset based on a MOS dataset for experiments. Experimental results confirm that, whether in training scenarios with different data types and label conditions, or in both in-domain and out-of-domain test scenarios, the prediction accuracy of UPP-SQA outperforms that of the baseline models, demonstrating its universality.
中文: 本文提出了一种通用的成对语音质量评估模型,通过整合单个语音的平均意见分数来预测样本间的偏好得分,并在多种数据和测试场景下验证了其优于基准模型的预测准确性。
English: This paper introduces a universal pairwise speech quality assessment model that predicts preference scores between speech samples by combining individual MOS predictions and demonstrates superior accuracy across diverse data and test scenarios compared to baseline models.
Authors:Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li
Abstract:
As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that \ourapproach effectively balances the diversity and effectiveness of jailbreak prompts compared to existing methods. Our work significantly improves the efficiency of red team exploration and provides a new perspective on automated red teaming.
Chinese: 提出的强化学习框架 \ourapproach 通过三阶段训练过程生成多样且有效的攻击提示,改进了自动化红队测试,同时提升了大型语言模型的安全性评估效率和越狱检测能力。
English: The proposed reinforcement learning framework, \ourapproach, enhances automated red teaming by generating diverse and effective attack prompts through a three-stage training process, improving both safety evaluation efficiency and jailbreak detection in large language models.
Authors:Wei Tao, Xiaoyang Qu, Kai Lu, Jiguang Wan, Shenglin He, Jianzong Wang
Abstract:
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.
中文摘要:BAGNet提出了一种边界感知图注意力网络,通过捕捉边界点的复杂空间特征降低计算成本,在点云语义分割中实现了更高精度和更快推理速度。
English Summary: BAGNet introduces a boundary-aware graph attention network that efficiently captures complex spatial features of boundary points while reducing computational costs, achieving superior accuracy and faster inference in point cloud semantic segmentation.
Authors:Jisoo Jeong, Hong Cai, Jamie Menjay Lin, Fatih Porikli
Abstract:
Conventional training for optical flow and stereo depth models typically employs a uniform loss function across all pixels. However, this one-size-fits-all approach often overlooks the significant variations in learning difficulty among individual pixels and contextual regions. This paper investigates the uncertainty-based confidence maps which capture these spatially varying learning difficulties and introduces tailored solutions to address them. We first present the Difficulty Balancing (DB) loss, which utilizes an error-based confidence measure to encourage the network to focus more on challenging pixels and regions. Moreover, we identify that some difficult pixels and regions are affected by occlusions, resulting from the inherently ill-posed matching problem in the absence of real correspondences. To address this, we propose the Occlusion Avoiding (OA) loss, designed to guide the network into cycle consistency-based confident regions, where feature matching is more reliable. By combining the DB and OA losses, we effectively manage various types of challenging pixels and regions during training. Experiments on both optical flow and stereo depth tasks consistently demonstrate significant performance improvements when applying our proposed combination of the DB and OA losses.
Chinese: 本文提出难度平衡损失和遮挡避免损失,针对光流与立体深度模型中不同像素和区域的学习难度差异进行优化,通过聚焦困难区域显著提升了模型性能。
English: This paper introduces a Difficulty Balancing (DB) loss and an Occlusion Avoiding (OA) loss to address spatially varying learning difficulties in optical flow and stereo depth models, significantly improving performance by focusing on challenging pixels and regions.
Authors:Gijs Luijten, Roberto Maria Scardigno, Lisle Faray de Paiva, Peter Hoyer, Jens Kleesiek, Domenico Buongiorno, Vitoantonio Bevilacqua, Jan Egger
Abstract:
Ultrasound (US) is widely accessible and radiation-free but has a steep learning curve due to its dynamic nature and non-standard imaging planes. Additionally, the constant need to shift focus between the US screen and the patient poses a challenge. To address these issues, we integrate deep learning (DL)-based semantic segmentation for real-time (RT) automated kidney volumetric measurements, which are essential for clinical assessment but are traditionally time-consuming and prone to fatigue. This automation allows clinicians to concentrate on image interpretation rather than manual measurements. Complementing DL, augmented reality (AR) enhances the usability of US by projecting the display directly into the clinician's field of view, improving ergonomics and reducing the cognitive load associated with screen-to-patient transitions. Two AR-DL-assisted US pipelines on HoloLens-2 are proposed: one streams directly via the application programming interface for a wireless setup, while the other supports any US device with video output for broader accessibility. We evaluate RT feasibility and accuracy using the Open Kidney Dataset and open-source segmentation models (nnU-Net, Segmenter, YOLO with MedSAM and LiteMedSAM). Our open-source GitHub pipeline includes model implementations, measurement algorithms, and a Wi-Fi-based streaming solution, enhancing US training and diagnostics, especially in point-of-care settings.
Chinese: 本研究结合深度学习实现实时自动肾脏体积测量,并利用增强现实技术将超声显示投射到医生视野中,通过两种AR-DL辅助方案解决传统超声学习曲线陡峭和认知负荷高的问题,已验证其可行性与准确性。
English: This study integrates deep learning for real-time automated kidney volume measurements and augmented reality to project ultrasound displays into the clinician's view, addressing the steep learning curve and cognitive load of traditional ultrasound through two proposed AR-DL-assisted pipelines evaluated for feasibility and accuracy.
Authors:Yuanfang Ren, Esra Adiyeke, Ziyuan Guan, Zhenhong Hu, Mackenzie J Meni, Benjamin Shickel, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac
Abstract:
Despite advances in surgical techniques and care, postoperative complications are prevalent and effects up to 15% of the patients who underwent a major surgery. The objective of this study is to develop and validate models for predicting postoperative complications and death after major surgery on a large and multicenter dataset, following the previously validated MySurgeryRisk algorithm. This retrospective, longitudinal and multicenter cohort analysis included 508,097 encounters from 366,875 adult inpatients who underwent major surgeries and were admitted to healthcare institutions within the OneFlorida+ network between 01/01/2012 and 04/29/2023. We applied the validated feature selection and transformation approach in MySurgeryRisk models and redeveloped eXtreme Gradient Boosting (XGBoost) models for predicting risk of postoperative acute kidney injury (AKI), need for intensive care unit (ICU) admission, need for mechanical ventilation (MV) therapy and in-hospital mortality on a development set and evaluated the model performance on a validation set. Area under the receiver operating characteristics curve values were obtained for need for ICU admission, 0.93 (95% Confidence Interval [CI], 0.93-0.93); need for MV, 0.94 (95% CI, 0.94-0.94); AKI, 0.92 (95% CI, 0.92-0.92); and in-hospital mortality, 0.95 (95% CI, 0.94-0.95). Area under the precision-recall curve values were computed for need for ICU admission, 0.62 (95% CI, 0.62-0.63); need for MV, 0.51 (95% CI, 0.49-0.52); AKI, 0.53 (95% CI, 0.53-0.54); and in-hospital mortality, 0.26 (95% CI, 0.24-0.29). The performance of these models is comparable to that of the previously validated MySurgeryRisk models, suggesting the enhanced generalizability of the models. Primary procedure code and provider specialty consistently appeared as the top influential variables, providing valuable insights into the factors influencing surgical outcomes.
中文: 本研究基于大型多中心数据集开发并验证了预测术后并发症的XGBoost模型,其性能与现有算法相当,其中主要手术代码和医生专业被确认为关键预测因素。
English: This study developed and validated XGBoost models using a large multicenter dataset to predict postoperative complications, demonstrating high performance comparable to existing algorithms with primary procedure codes and provider specialties as key predictors.
Authors:Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Abstract:
Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct ShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop ShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new state-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
中文: 本文提出ShotBench基准测试,专门评估视觉语言模型对电影语言的理解能力,揭示了现有模型的不足,并通过ShotQA数据集训练出ShotVL模型,实现了最先进的性能表现。
English: This paper introduces ShotBench, a specialized benchmark for evaluating Vision-Language Models' understanding of cinematic language, revealing their limitations and proposing ShotVL, a new model trained on the ShotQA dataset that achieves state-of-the-art performance.
Authors:Suorong Yang, Peijia Li, Furao Shen, Jian Zhao
Abstract:
Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. We introduce the concept of epsilon-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process and propose RL-Selector, where a lightweight RL agent optimizes the selection policy by leveraging epsilon-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency.
中文摘要:本研究提出RL-Selector方法,通过强化学习框架结合ε样本覆盖概念动态选择代表性数据,在多种基准测试中显著提升训练效率并增强模型泛化性能。
English Summary: The study introduces RL-Selector, a reinforcement learning-based method that uses the epsilon-sample cover concept to dynamically select representative data samples, significantly improving training efficiency and model generalization across various benchmarks.
Authors:Xiaoyu Li, Zhao Song, Jiahao Zhang
Abstract:
The explosive growth of AI research has driven paper submissions at flagship AI conferences to unprecedented levels, necessitating many venues in 2025 (e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author submission limits and to desk-reject any excess papers by simple ID order. While this policy helps reduce reviewer workload, it may unintentionally discard valuable papers and penalize authors' efforts. In this paper, we ask an essential research question on whether it is possible to follow submission limits while minimizing needless rejections. We first formalize the current desk-rejection policies as an optimization problem, and then develop a practical algorithm based on linear programming relaxation and a rounding scheme. Under extensive evaluation on 11 years of real-world ICLR (International Conference on Learning Representations) data, our method preserves up to $19.23\%$ more papers without violating any author limits. Moreover, our algorithm is highly efficient in practice, with all results on ICLR data computed within at most 53.64 seconds. Our work provides a simple and practical desk-rejection strategy that significantly reduces unnecessary rejections, demonstrating strong potential to improve current CS conference submission policies.
中文摘要:针对AI会议投稿量激增的问题,本文提出一种高效算法,在严格遵守作者投稿限制的前提下,通过线性规划松弛与取整方案,可在ICLR真实数据上减少高达19.23%的不必要拒稿,为改进会议审稿政策提供了实用解决方案。
English Summary: To address the overwhelming number of submissions at AI conferences, this paper proposes an efficient algorithm that significantly reduces unnecessary paper rejections by up to 19.23% while strictly adhering to author submission limits, as validated on 11 years of ICLR data.
Authors:Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Shuai Wang
Abstract:
Capture-the-Flag (CTF) competitions are crucial for cybersecurity education and training. As large language models (LLMs) evolve, there is increasing interest in their ability to automate CTF challenge solving. For example, DARPA has organized the AIxCC competition since 2023 to advance AI-powered automated offense and defense. However, this demands a combination of multiple abilities, from knowledge to reasoning and further to actions. In this paper, we highlight the importance of technical knowledge in solving CTF problems and deliberately construct a focused benchmark, CTFKnow, with 3,992 questions to measure LLMs' performance in this core aspect. Our study offers a focused and innovative measurement of LLMs' capability in understanding CTF knowledge and applying it to solve CTF challenges. Our key findings reveal that while LLMs possess substantial technical knowledge, they falter in accurately applying this knowledge to specific scenarios and adapting their strategies based on feedback from the CTF environment.
Based on insights derived from this measurement study, we propose CTFAgent, a novel LLM-driven framework for advancing CTF problem-solving. CTFAgent introduces two new modules: two-stage Retrieval Augmented Generation (RAG) and interactive Environmental Augmentation, which enhance LLMs' technical knowledge and vulnerability exploitation on CTF, respectively. Our experimental results show that, on two popular CTF datasets, CTFAgent both achieves over 80% performance improvement. Moreover, in the recent picoCTF2024 hosted by CMU, CTFAgent ranked in the top 23.6% of nearly 7,000 participating teams. This reflects the benefit of our measurement study and the potential of our framework in advancing LLMs' capabilities in CTF problem-solving.
中文: 本研究提出了CTFKnow基准来评估大语言模型在网络安全夺旗赛中的技术知识掌握程度,发现其虽具备扎实理论基础但在实际应用中存在不足,并开发了CTFAgent增强框架,通过两阶段检索生成和交互式环境增强模块显著提升了模型表现。
English: This study introduces CTFKnow, a benchmark for assessing large language models' technical knowledge in cybersecurity Capture-the-Flag competitions, revealing their limitations in practical application despite strong theoretical understanding, and proposes CTFAgent, an enhanced framework that significantly improves performance through advanced retrieval and interactive modules.
Authors:Zongjie Li, Daoyuan Wu, Shuai Wang, Zhendong Su
Abstract:
The increasing demand for domain-specific and human-aligned Large Language Models (LLMs) has led to the widespread adoption of Supervised Fine-Tuning (SFT) techniques. SFT datasets often comprise valuable instruction-response pairs, making them highly valuable targets for potential extraction. This paper studies this critical research problem for the first time. We start by formally defining and formulating the problem, then explore various attack goals, types, and variants based on the unique properties of SFT data in real-world scenarios. Based on our analysis of extraction behaviors of direct extraction, we develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE), which exploits the confidence levels of fine-tuned models and their behavioral differences from pre-trained base models. Through extensive experiments across multiple domains and scenarios, we demonstrate the feasibility of SFT data extraction using DDE. Our results show that DDE consistently outperforms existing extraction baselines in all attack settings. To counter this new attack, we propose a defense mechanism that mitigates DDE attacks with minimal impact on model performance. Overall, our research reveals hidden data leak risks in fine-tuned LLMs and provides insights for developing more secure models.
中文: 本文提出差异化数据提取(DDE)方法,通过利用精调后模型的置信度差异实现SFT数据的高效提取,在超越现有技术的同时设计了防御机制来应对此类数据泄露风险。
English: This paper introduces Differentiated Data Extraction (DDE), a novel method that exploits fine-tuned LLMs' confidence levels to effectively extract SFT data, demonstrating its superiority over existing techniques while proposing a defense mechanism to mitigate such attacks.
Authors:Elisabetta Biondi, Chiara Boldrini, Andrea Passarella, Marco Conti
Abstract:
Online social networks (OSNs) have transformed the way individuals fulfill their social needs and consume information. As OSNs become increasingly prominent sources for news dissemination, individuals often encounter content that influences their opinions through both direct interactions and broader network dynamics. In this paper, we propose the Friedkin-Johnsen on Cascade (FJC) model, which is, to the best of our knowledge, is the first attempt to integrate information cascades and opinion dynamics, specifically using the very popular Friedkin-Johnsen model. Our model, validated over real social cascades, highlights how the convergence of socialization and sharing news on these platforms can disrupt opinion evolution dynamics typically observed in offline settings. Our findings demonstrate that these cascades can amplify the influence of central opinion leaders, making them more resistant to divergent viewpoints, even when challenged by a critical mass of dissenting opinions. This research underscores the importance of understanding the interplay between social dynamics and information flow in shaping public discourse in the digital age.
中文摘要:本文提出的级联弗里德金-约翰逊模型首次将信息级联与观点动力学相结合,揭示了在线社交网络如何通过放大核心意见领袖的影响力并使其抵制不同观点,从而改变传统观点演变模式。
English Summary: The proposed Friedkin-Johnsen on Cascade model integrates information cascades with opinion dynamics, revealing how online social networks amplify opinion leaders' influence and disrupt traditional opinion evolution by making them resistant to opposing views.
Authors:Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Yutong Ban
Abstract:
While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.
中文: 本研究提出的GenHOI框架通过物体锚点网络重构未知物体的稀疏3D交互关键帧,并利用接触感知扩散模型将其插值为稠密4D序列,在公开数据集上实现了最优性能并展现出强大的泛化能力。
English: This study introduces GenHOI, a two-stage framework that reconstructs sparse 3D human-object interaction keyframes for unseen objects and interpolates them into dense 4D sequences using a contact-aware diffusion model, achieving state-of-the-art results with strong generalization.
Authors:Md Abrar Jahin, Shahriar Soudeep, Arian Rahman Aditta, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen
Abstract:
Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.
中文: 本研究使用模拟CMS数据系统评估了视觉变换器及其混合模型在夸克-胶子喷注分类中的表现,证明其通过有效捕捉喷注子结构中的长程空间关联,性能显著优于传统卷积神经网络基准。
English: This study systematically evaluates Vision Transformer (ViT) and hybrid models for quark-gluon jet classification using simulated CMS data, demonstrating their superior performance over CNN baselines by effectively capturing long-range spatial correlations in jet substructure.
Authors:Md Abrar Jahin, Adiba Abid, M. F. Mridha
Abstract:
Expert systems often operate in domains characterized by class-imbalanced tabular data, where detecting rare but critical instances is essential for safety and reliability. While conventional approaches, such as cost-sensitive learning, oversampling, and graph neural networks, provide partial solutions, they suffer from drawbacks like overfitting, label noise, and poor generalization in low-density regions. To address these challenges, we propose QCL-MixNet, a novel Quantum-Informed Contrastive Learning framework augmented with k-nearest neighbor (kNN) guided dynamic mixup for robust classification under imbalance. QCL-MixNet integrates three core innovations: (i) a Quantum Entanglement-inspired layer that models complex feature interactions through sinusoidal transformations and gated attention, (ii) a sample-aware mixup strategy that adaptively interpolates feature representations of semantically similar instances to enhance minority class representation, and (iii) a hybrid loss function that unifies focal reweighting, supervised contrastive learning, triplet margin loss, and variance regularization to improve both intra-class compactness and inter-class separability. Extensive experiments on 18 real-world imbalanced datasets (binary and multi-class) demonstrate that QCL-MixNet consistently outperforms 20 state-of-the-art machine learning, deep learning, and GNN-based baselines in macro-F1 and recall, often by substantial margins. Ablation studies further validate the critical role of each architectural component. Our results establish QCL-MixNet as a new benchmark for tabular imbalance handling in expert systems. Theoretical analyses reinforce its expressiveness, generalization, and optimization robustness.
中文摘要:QCL-MixNet是一种量子启发的对比学习框架,通过结合量子特征建模、动态混合策略和混合损失函数,显著提升了不平衡表格数据中的分类性能,在多个现实数据集上超越现有先进方法。
English Summary: QCL-MixNet is a quantum-informed contrastive learning framework that enhances classification in imbalanced tabular data by integrating quantum-inspired feature modeling, dynamic mixup, and a hybrid loss function, achieving superior performance across multiple real-world datasets.
Authors:Yuwei Du, Jie Feng, Jian Yuan, Yong Li
Abstract:
Human mobility simulation plays a crucial role in various real-world applications. Recently, to address the limitations of traditional data-driven approaches, researchers have explored leveraging the commonsense knowledge and reasoning capabilities of large language models (LLMs) to accelerate human mobility simulation. However, these methods suffer from several critical shortcomings, including inadequate modeling of urban spaces and poor integration with both individual mobility patterns and collective mobility distributions. To address these challenges, we propose \textbf{C}ityGPT-Powered \textbf{A}gentic framework for \textbf{M}obility \textbf{S}imulation (\textbf{CAMS}), an agentic framework that leverages the language based urban foundation model to simulate human mobility in urban space. \textbf{CAMS} comprises three core modules, including MobExtractor to extract template mobility patterns and synthesize new ones based on user profiles, GeoGenerator to generate anchor points considering collective knowledge and generate candidate urban geospatial knowledge using an enhanced version of CityGPT, TrajEnhancer to retrieve spatial knowledge based on mobility patterns and generate trajectories with real trajectory preference alignment via DPO. Experiments on real-world datasets show that \textbf{CAMS} achieves superior performance without relying on externally provided geospatial information. Moreover, by holistically modeling both individual mobility patterns and collective mobility constraints, \textbf{CAMS} generates more realistic and plausible trajectories. In general, \textbf{CAMS} establishes a new paradigm that integrates the agentic framework with urban-knowledgeable LLMs for human mobility simulation.
中文: 提出的CAMS框架利用具备城市知识的大语言模型,通过整合个体移动模式和群体分布来克服人类移动模拟的局限性,在不依赖外部地理空间数据的情况下实现了卓越性能。
English: The proposed CAMS framework leverages urban-knowledgeable large language models to overcome limitations in human mobility simulation by integrating individual patterns and collective distributions, achieving superior performance without external geospatial data.
Authors:Filippo Marostica, Alessio Carpegna, Alessandro Savino, Stefano Di Carlo
Abstract:
This paper presents a comprehensive evaluation of Spiking Neural Network (SNN) neuron models for hardware acceleration by comparing event driven and clock-driven implementations. We begin our investigation in software, rapidly prototyping and testing various SNN models based on different variants of the Leaky Integrate and Fire (LIF) neuron across multiple datasets. This phase enables controlled performance assessment and informs design refinement. Our subsequent hardware phase, implemented on FPGA, validates the simulation findings and offers practical insights into design trade offs. In particular, we examine how variations in input stimuli influence key performance metrics such as latency, power consumption, energy efficiency, and resource utilization. These results yield valuable guidelines for constructing energy efficient, real time neuromorphic systems. Overall, our work bridges software simulation and hardware realization, advancing the development of next generation SNN accelerators.
中文: 本研究通过软件和FPGA平台比较事件驱动与时钟驱动的脉冲神经网络神经元模型,为设计高能效神经形态系统提供了实用指导。
English: This study evaluates Spiking Neural Network neuron models by comparing event-driven and clock-driven implementations in software and on FPGA, providing guidelines for designing energy-efficient neuromorphic systems.
Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma
Abstract:
A new class of audio deepfakes-codecfakes (CFs)-has recently caught attention, synthesized by Audio Language Models that leverage neural audio codecs (NACs) in the backend. In response, the community has introduced dedicated benchmarks and tailored detection strategies. As the field advances, efforts have moved beyond binary detection toward source attribution, including open-set attribution, which aims to identify the NAC responsible for generation and flag novel, unseen ones during inference. This shift toward source attribution improves forensic interpretability and accountability. However, open-set attribution remains fundamentally limited: while it can detect that a NAC is unfamiliar, it cannot characterize or identify individual unseen codecs. It treats such inputs as generic ``unknowns'', lacking insight into their internal configuration. This leads to major shortcomings: limited generalization to new NACs and inability to resolve fine-grained variations within NAC families. To address these gaps, we propose Neural Audio Codec Source Parsing (NACSP) - a paradigm shift that reframes source attribution for CFs as structured regression over generative NAC parameters such as quantizers, bandwidth, and sampling rate. We formulate NACSP as a multi-task regression task for predicting these NAC parameters and establish the first comprehensive benchmark using various state-of-the-art speech pre-trained models (PTMs). To this end, we propose HYDRA, a novel framework that leverages hyperbolic geometry to disentangle complex latent properties from PTM representations. By employing task-specific attention over multiple curvature-aware hyperbolic subspaces, HYDRA enables superior multi-task generalization. Our extensive experiments show HYDRA achieves top results on benchmark CFs datasets compared to baselines operating in Euclidean space.
中文摘要:针对音频深度伪造检测的局限性,神经音频编解码器源解析(NACSP)新范式通过将源属性识别重构为对神经音频编解码器参数的结构化回归,并利用双曲几何的HYDRA框架实现了卓越的多任务泛化能力。
English Summary: A new paradigm called Neural Audio Codec Source Parsing (NACSP) addresses limitations in audio deepfake detection by reframing source attribution as structured regression over neural audio codec parameters, with the proposed HYDRA framework leveraging hyperbolic geometry to achieve superior multi-task generalization.
Authors:André Ferreira, Kunpeng Xie, Caroline Wilpert, Gustavo Correia, Felix Barajas Ordonez, Tiago Gil Oliveira, Maike Bode, Robert Siepmann, Frank Hölzle, Rainer Röhrig, Jens Kleesiek, Daniel Truhn, Jan Egger, Victor Alves, Behrus Puladi
Abstract:
AI requires extensive datasets, while medical data is subject to high data protection. Anonymization is essential, but poses a challenge for some regions, such as the head, as identifying structures overlap with regions of clinical interest. Synthetic data offers a potential solution, but studies often lack rigorous evaluation of realism and utility. Therefore, we investigate to what extent synthetic data can replace real data in segmentation tasks. We employed head and neck cancer CT scans and brain glioma MRI scans from two large datasets. Synthetic data were generated using generative adversarial networks and diffusion models. We evaluated the quality of the synthetic data using MAE, MS-SSIM, Radiomics and a Visual Turing Test (VTT) performed by 5 radiologists and their usefulness in segmentation tasks using DSC. Radiomics indicates high fidelity of synthetic MRIs, but fall short in producing highly realistic CT tissue, with correlation coefficient of 0.8784 and 0.5461 for MRI and CT tumors, respectively. DSC results indicate limited utility of synthetic data: tumor segmentation achieved DSC=0.064 on CT and 0.834 on MRI, while bone segmentation a mean DSC=0.841. Relation between DSC and correlation is observed, but is limited by the complexity of the task. VTT results show synthetic CTs' utility, but with limited educational applications. Synthetic data can be used independently for the segmentation task, although limited by the complexity of the structures to segment. Advancing generative models to better tolerate heterogeneous inputs and learn subtle details is essential for enhancing their realism and expanding their application potential.
中文: 本研究评估了通过生成对抗网络和扩散模型生成的合成医学数据在分割任务中的应用,发现合成MRI数据具有较高的保真度和实用性,而合成CT数据在真实性和有效性方面表现不足,尤其对于肿瘤等复杂结构。
English: This study evaluates synthetic medical data generated by GANs and diffusion models for segmentation tasks, finding that while synthetic MRI data shows high fidelity and utility, synthetic CT data falls short in realism and effectiveness, particularly for complex structures like tumors.
Authors:Junyu Liu, Kaiqi Yan, Tianyang Wang, Qian Niu, Momoko Nagai-Tanima, Tomoki Aoyama
Abstract:
Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.
中文: KokushiMD-10是基于十项日本国家医疗执照考试构建的首个多模态基准,评估了30多种先进大语言模型在多个医疗领域的表现,结果显示无模型能稳定通过,凸显了医疗AI持续面临的挑战。
English: KokushiMD-10 is a multimodal benchmark derived from ten Japanese healthcare licensing exams, evaluating over 30 advanced LLMs across multiple medical fields and revealing that none consistently pass, highlighting persistent challenges in medical AI.
Authors:Suhan Guo, Zhenghao Xu, Furao Shen, Jian Zhao
Abstract:
Accurate prediction of contagious disease outbreaks is vital for informed decision-making. Our study addresses the gap between machine learning algorithms and their epidemiological applications, noting that methods optimal for benchmark datasets often underperform with real-world data due to difficulties in incorporating mobility information. We adopt a two-phase approach: first, assessing the significance of mobility data through a pilot study, then evaluating the impact of Graph Convolutional Networks (GCNs) on a transformer backbone. Our findings reveal that while mobility data and GCN modules do not significantly enhance forecasting performance, the inclusion of mortality and hospitalization data markedly improves model accuracy. Additionally, a comparative analysis between GCN-derived spatial maps and lockdown orders suggests a notable correlation, highlighting the potential of spatial maps as sensitive indicators for mobility. Our research offers a novel perspective on mobility representation in predictive modeling for contagious diseases, empowering decision-makers to better prepare for future outbreaks.
中文摘要:本研究发现,引入死亡率和住院数据能显著提高传染病预测准确性,而移动性数据和图卷积网络虽对预测改善有限,但揭示了空间地图作为移动性敏感指标的潜力,为传染病预测模型提供了新视角。
English Summary: This study finds that incorporating mortality and hospitalization data significantly improves contagious disease forecasting accuracy, while mobility data and Graph Convolutional Networks show limited impact but reveal spatial maps' potential as sensitive mobility indicators.
Authors:Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang, Wei Shen
Abstract:
Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.
中文: 本文提出MedSeg-R端到端框架,利用多模态大语言模型解析复杂临床指令并生成精确的医学图像分割掩码,同时发布了包含上万样本的MedSeg-QA数据集,实验证明该方法在分割精度和可解释性方面均优于现有基准。
English: This paper introduces MedSeg-R, an end-to-end framework that leverages multimodal large language models to interpret complex clinical instructions and generate precise segmentation masks for medical images, addressing the limitations of existing models in active reasoning and mask generation.
Authors:Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu
Abstract:
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.
Chinese: 本文提出了一种基于小波变换的多模态意图识别框架,通过频域分析非语言信息提升意图理解能力,在MIntRec数据集上以1.13%的准确率优势实现了最优性能。
English: This paper introduces a Wavelet-Driven Multimodal Intent Recognition (WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal cues, achieving state-of-the-art performance with a 1.13% accuracy improvement on MIntRec.
Authors:Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen
Abstract:
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
Chinese: 本研究提出ReSim可控世界模型,通过融合真实驾驶数据与模拟器中的多样化非专家行为,显著提升了驾驶场景仿真的真实性和可控性,适用于专家及危险驾驶行为的模拟。
English: This work introduces ReSim, a controllable world model that enhances driving scenario simulation by integrating real-world data with diverse non-expert behaviors from simulators, improving fidelity and controllability for both expert and hazardous actions.
Authors:Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang
Abstract:
As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model's ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.
中文摘要:大型语言模型会发展出一个核心的语言无关参数空间,该空间支持超越具体语言的抽象思维形成,随着模型发展,共享神经元的重要性日益增强而语言专属神经元逐渐弱化。
English Summary: Large language models develop a core language-agnostic parameter space that enables abstract thought beyond specific languages, with shared neurons gaining prominence during development while language-specific neurons diminish in influence.
Authors:Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv
Abstract:
End-to-end autonomous driving has emerged as a promising paradigm for directly mapping sensor inputs to planning maneuvers using learning-based modular integrations. However, existing imitation learning (IL)-based models suffer from generalization to hard cases, and a lack of corrective feedback loop under post-deployment. While reinforcement learning (RL) offers a potential solution to tackle hard cases with optimality, it is often hindered by overfitting to specific driving cases, resulting in catastrophic forgetting of generalizable knowledge and sample inefficiency. To overcome these challenges, we propose Reinforced Refinement with Self-aware Expansion (R2SE), a novel learning pipeline that constantly refines hard domain while keeping generalizable driving policy for model-agnostic end-to-end driving systems. Through reinforcement fine-tuning and policy expansion that facilitates continuous improvement, R2SE features three key components: 1) Generalist Pretraining with hard-case allocation trains a generalist imitation learning (IL) driving system while dynamically identifying failure-prone cases for targeted refinement; 2) Residual Reinforced Specialist Fine-tuning optimizes residual corrections using reinforcement learning (RL) to improve performance in hard case domain while preserving global driving knowledge; 3) Self-aware Adapter Expansion dynamically integrates specialist policies back into the generalist model, enhancing continuous performance improvement. Experimental results in closed-loop simulation and real-world datasets demonstrate improvements in generalization, safety, and long-horizon policy robustness over state-of-the-art E2E systems, highlighting the effectiveness of reinforce refinement for scalable autonomous driving.
中文摘要:提出的R2SE框架通过结合模仿学习的通用策略与强化学习的困难场景处理,采用持续优化机制提升自动驾驶系统的安全性和泛化能力。
English Summary: The proposed R2SE framework enhances autonomous driving by combining imitation learning for general policies with reinforcement learning for hard cases, using continuous refinement to improve safety and generalization.
Authors:Zongjie Li, Shuai Wang
Abstract:
This position paper proposes a fundamental shift in designing code generation models: treating reasoning depth as a controllable resource. Rather than being an incidental byproduct of prompting, we argue that the trade-off between rapid, direct answers ("fast thinking") and elaborate, chain-of-thought deliberation ("slow thinking") must be explicitly managed. We contend that optimizing reasoning budgets across the entire model lifecycle - from synthetic data creation and benchmarking to real-world deploymen - can unlock superior trade-offs among accuracy, latency, and cost. This paper outlines how adaptive control over reasoning can enrich supervision signals, motivate new multi-dimensional benchmarks, and inform cost-aware, security-conscious deployment policies. By viewing fast and slow thinking as complementary modes to be scheduled, we envision coding agents that think deep when necessary and act fast when possible.
中文摘要:本立场文件主张将推理深度作为代码生成模型的可控资源,通过自适应切换快速直接响应与精细思维链推理,在整个模型生命周期中优化准确性、延迟和成本的平衡。
English Summary: This position paper advocates for treating reasoning depth as a controllable resource in code generation models, enabling adaptive switching between fast direct responses and elaborate chain-of-thought reasoning to optimize accuracy, latency, and cost throughout the model lifecycle.
Authors:Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
Abstract:
Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and 2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation scenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements.
中文摘要:尽管进行了安全对齐,大语言模型仍存在漏洞,因此提出了RSafe——一种基于推理的安全防护机制,通过策略引导分析和强化对齐来针对未知威胁提供自适应保护。
English Summary: Despite safety alignment efforts, Large Language Models remain vulnerable, prompting the development of RSafe—a reasoning-based safeguard that uses policy-guided analysis and reinforced alignment to provide adaptive protection against emerging threats.
Authors:Weiqi Yan, Lvhai Chen, Huaijia Kou, Shengchuan Zhang, Yan Zhang, Liujuan Cao
Abstract:
Unsupervised Camoflaged Object Detection (UCOD) has gained attention since it doesn't need to rely on extensive pixel-level labels. Existing UCOD methods typically generate pseudo-labels using fixed strategies and train 1 x1 convolutional layers as a simple decoder, leading to low performance compared to fully-supervised methods. We emphasize two drawbacks in these approaches: 1). The model is prone to fitting incorrect knowledge due to the pseudo-label containing substantial noise. 2). The simple decoder fails to capture and learn the semantic features of camouflaged objects, especially for small-sized objects, due to the low-resolution pseudo-labels and severe confusion between foreground and background pixels. To this end, we propose a UCOD method with a teacher-student framework via Dynamic Pseudo-label Learning called UCOD-DPL, which contains an Adaptive Pseudo-label Module (APM), a Dual-Branch Adversarial (DBA) decoder, and a Look-Twice mechanism. The APM module adaptively combines pseudo-labels generated by fixed strategies and the teacher model to prevent the model from overfitting incorrect knowledge while preserving the ability for self-correction; the DBA decoder takes adversarial learning of different segmentation objectives, guides the model to overcome the foreground-background confusion of camouflaged objects, and the Look-Twice mechanism mimics the human tendency to zoom in on camouflaged objects and performs secondary refinement on small-sized objects. Extensive experiments show that our method demonstrates outstanding performance, even surpassing some existing fully supervised methods. The code is available now.
中文: 提出的UCOD-DPL方法通过自适应伪标签学习和双分支解码器解决了无监督伪装目标检测中的关键缺陷,其性能甚至超越了部分全监督方法。
English: The proposed UCOD-DPL method addresses limitations in unsupervised camouflaged object detection by introducing adaptive pseudo-label learning and a dual-branch decoder, achieving performance comparable to supervised approaches.
Authors:Gijs Luijten, Lisle Faray de Paiva, Sebastian Krueger, Alexander Brost, Laura Mazilescu, Ana Sofia Ferreira Santos, Peter Hoyer, Jens Kleesiek, Sophia Marie-Therese Schmitz, Ulf Peter Neumann, Jan Egger
Abstract:
As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging.
中文摘要:我们团队评估了西门子电影级现实技术在苹果Vision Pro上的医学影像应用,专家反馈表明其具备临床可行性、显著可用性优势及特定功能需求,以推动实际工作流程整合。
English Summary: Our team evaluated Siemens' Cinematic Reality on Apple Vision Pro for medical imaging, finding it feasible with strong usability and specific feature needs for clinical integration based on expert feedback.
Authors:Lisle Faray de Paiva, Gijs Luijten, Ana Sofia Ferreira Santos, Moon Kim, Behrus Puladi, Jens Kleesiek, Jan Egger
Abstract:
Medical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.
中文: 本研究开发了一款基于Meta Quest 3和罗技手写笔的扩展现实医疗影像分割工具,用户测试表明其空间交互直观且能降低认知负荷,但在精度控制和错误处理方面仍需完善。
English: This study develops an extended reality tool using Meta Quest 3 and Logitech stylus for medical CT scan segmentation, demonstrating through user testing its intuitive spatial interaction and reduced cognitive load despite needing improvements in precision and error handling.
Authors:Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang
Abstract:
Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.
中文摘要:FineLogic提出了一种细粒度评估框架,超越最终答案准确性来评估大语言模型的逻辑推理能力,研究发现自然语言监督增强泛化能力而符号监督优化推理结构,且微调主要改进逐步生成过程而非早期答案收敛。
English Summary: FineLogic introduces a fine-grained framework to evaluate LLMs' logical reasoning beyond final-answer accuracy, revealing that natural language supervision enhances generalization while symbolic supervision improves structural reasoning, with fine-tuning primarily refining step-by-step generation processes.
Authors:Zhiyu Zhang, Wei Chen, Youfang Lin, Huaiyu Wan
Abstract:
Recent Continual Learning (CL)-based Temporal Knowledge Graph Reasoning (TKGR) methods focus on significantly reducing computational cost and mitigating catastrophic forgetting caused by fine-tuning models with new data. However, existing CL-based TKGR methods still face two key limitations: (1) They usually one-sidedly reorganize individual historical facts, while overlooking the historical context essential for accurately understanding the historical semantics of these facts; (2) They preserve historical knowledge by simply replaying historical facts, while ignoring the potential conflicts between historical and emerging facts. In this paper, we propose a Deep Generative Adaptive Replay (DGAR) method, which can generate and adaptively replay historical entity distribution representations from the whole historical context. To address the first challenge, historical context prompts as sampling units are built to preserve the whole historical context information. To overcome the second challenge, a pre-trained diffusion model is adopted to generate the historical distribution. During the generation process, the common features between the historical and current distributions are enhanced under the guidance of the TKGR model. In addition, a layer-by-layer adaptive replay mechanism is designed to effectively integrate historical and current distributions. Experimental results demonstrate that DGAR significantly outperforms baselines in reasoning and mitigating forgetting.
中文摘要:本文提出的深度生成自适应回放(DGAR)方法通过生成历史上下文表征并采用自适应回放机制,有效解决了时序知识图谱推理中现有持续学习方法忽视历史语义和知识冲突的问题,显著提升了推理性能并缓解了灾难性遗忘。
English Summary: The proposed Deep Generative Adaptive Replay (DGAR) method addresses limitations in existing continual learning approaches for temporal knowledge graph reasoning by generating historical context representations and implementing adaptive replay to enhance reasoning accuracy while reducing catastrophic forgetting.
Authors:Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Nicolas Boizard, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
Abstract:
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
中文摘要:EuroLLM-9B是专为欧洲语言需求开发的大语言模型,全面覆盖24种欧盟官方语言及11种附加语言,通过创新的数据过滤和合成训练方法,在多语言基准测试中展现出领先性能,成为当前最具代表性的欧洲开源语言模型。
English Summary: EuroLLM-9B is a European-developed large language model that comprehensively supports all 24 official EU languages and 11 additional languages, addressing the underrepresentation of European languages in existing models while demonstrating competitive performance in multilingual benchmarks.
Authors:Kamer Cekini, Elisabetta Biondi, Chiara Boldrini, Andrea Passarella, Marco Conti
Abstract:
Lockdown measures, implemented by governments during the initial phases of the COVID-19 pandemic to reduce physical contact and limit viral spread, imposed significant restrictions on in-person social interactions. Consequently, individuals turned to online social platforms to maintain connections. Ego networks, which model the organization of personal relationships according to human cognitive constraints on managing meaningful interactions, provide a framework for analyzing such dynamics. The disruption of physical contact and the predominant shift of social life online potentially altered the allocation of cognitive resources dedicated to managing these digital relationships. This research aims to investigate the impact of lockdown measures on the characteristics of online ego networks, presumably resulting from this reallocation of cognitive resources. To this end, a large dataset of Twitter users was examined, covering a seven-year period of activity. Analyzing a seven-year Twitter dataset -- including five years pre-pandemic and two years post -- we observe clear, though temporary, changes. During lockdown, ego networks expanded, social circles became more structured, and relationships intensified. Simultaneously, negative interactions increased, and users engaged with a broader range of topics, indicating greater thematic diversity. Once restrictions were lifted, these structural, emotional, and thematic shifts largely reverted to pre-pandemic norms -- suggesting a temporary adaptation to an extraordinary social context.
中文: 疫情期间的封锁措施暂时改变了在线自我网络,导致其扩张、结构增强、关系强化及主题多样性增加,但在限制解除后基本恢复至疫情前常态。
English: Lockdown measures during the COVID-19 pandemic temporarily altered online ego networks, causing expansion, increased structure, intensified relationships, and greater thematic diversity, which largely reverted to pre-pandemic norms after restrictions eased.
Authors:Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Abstract:
As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.
Chinese: 本研究提出SNIFR框架,通过融合音频和视觉线索对儿童视频中的有害内容进行细粒度检测,借助跨模态对齐实现了最优性能。
English: This study introduces SNIFR, a novel framework that integrates audio and visual cues for fine-grained detection of harmful content in children's videos, achieving state-of-the-art performance through effective cross-modal alignment.
Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this work, we introduce the task of singing voice deepfake source attribution (SVDSA). We hypothesize that multimodal foundation models (MMFMs) such as ImageBind, LanguageBind will be most effective for SVDSA as they are better equipped for capturing subtle source-specific characteristics-such as unique timbre, pitch manipulation, or synthesis artifacts of each singing voice deepfake source due to their cross-modality pre-training. Our experiments with MMFMs, speech foundation models and music foundation models verify the hypothesis that MMFMs are the most effective for SVDSA. Furthermore, inspired from related research, we also explore fusion of foundation models (FMs) for improved SVDSA. To this end, we propose a novel framework, COFFE which employs Chernoff Distance as novel loss function for effective fusion of FMs. Through COFFE with the symphony of MMFMs, we attain the topmost performance in comparison to all the individual FMs and baseline fusion methods.
中文摘要:本研究提出歌唱声音深度伪造来源识别任务,验证了多模态基础模型对此任务最为有效,并通过提出的COFFE框架实现了最优性能。
English Summary: This study introduces singing voice deepfake source attribution (SVDSA) and demonstrates that multimodal foundation models (MMFMs) are most effective for this task, with the proposed COFFE framework achieving state-of-the-art performance through novel fusion of foundation models.
Authors:Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
Abstract:
LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
中文: LLM-as-a-judge框架提出定量LLM评判器,通过回归模型将自动评估分数与人类评分对齐,在有限人工反馈下实现更高的计算和统计效率。
English: The LLM-as-a-judge framework introduces quantitative LLM judges that use regression models to align automated evaluation scores with human judgments, improving computational and statistical efficiency without extensive fine-tuning.
Authors:Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen
Abstract:
Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.
中文: DeepShop基准通过生成多样化和多层级的查询,并采用自动细粒度和整体评估框架,旨在复杂在线购物环境中评测网络代理,揭示了现有方法在处理现实购物任务中的显著不足。
English: The DeepShop benchmark is introduced to evaluate web agents in complex online shopping environments by generating diverse and multi-layered queries with automated fine-grained and holistic assessments, revealing current methods' limitations in handling realistic shopping tasks.
Authors:Xiaochong Lan, Jie Feng, Jiahuan Lei, Xinlei Shi, Yong Li
Abstract:
Large language models (LLMs) have exhibited remarkable capabilities and achieved significant breakthroughs across various domains, leading to their widespread adoption in recent years. Building on this progress, we investigate their potential in the realm of local life services. In this study, we establish a comprehensive benchmark and systematically evaluate the performance of diverse LLMs across a wide range of tasks relevant to local life services. To further enhance their effectiveness, we explore two key approaches: model fine-tuning and agent-based workflows. Our findings reveal that even a relatively compact 7B model can attain performance levels comparable to a much larger 72B model, effectively balancing inference cost and model capability. This optimization greatly enhances the feasibility and efficiency of deploying LLMs in real-world online services, making them more practical and accessible for local life applications.
中文: 大语言模型在本地生活服务中展现出巨大潜力,通过微调和智能体工作流,较小的7B模型即可媲美72B大模型的性能,有效平衡成本与能力,提升实际应用可行性。
English: Large language models demonstrate strong potential in local life services, where even a smaller 7B model can match the performance of a much larger 72B model through fine-tuning and agent-based workflows, optimizing cost and capability for practical deployment.
Authors:Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Girish, Swarup Ranjan Behera, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this work, we focus on non-verbal vocal sounds emotion recognition (NVER). We investigate mamba-based audio foundation models (MAFMs) for the first time for NVER and hypothesize that MAFMs will outperform attention-based audio foundation models (AAFMs) for NVER by leveraging its state-space modeling to capture intrinsic emotional structures more effectively. Unlike AAFMs, which may amplify irrelevant patterns due to their attention mechanisms, MAFMs will extract more stable and context-aware representations, enabling better differentiation of subtle non-verbal emotional cues. Our experiments with state-of-the-art (SOTA) AAFMs and MAFMs validates our hypothesis. Further, motivated from related research such as speech emotion recognition, synthetic speech detection, where fusion of foundation models (FMs) have showed improved performance, we also explore fusion of FMs for NVER. To this end, we propose, RENO, that uses renyi-divergence as a novel loss function for effective alignment of the FMs. It also makes use of self-attention for better intra-representation interaction of the FMs. With RENO, through the heterogeneous fusion of MAFMs and AAFMs, we show the topmost performance in comparison to individual FMs, its fusion and also setting SOTA in comparison to previous SOTA work.
中文摘要:本研究首次将基于Mamba的音频基础模型应用于非语言情感识别,验证其优于基于注意力的模型,并提出融合框架RENO,通过新型损失函数整合两类模型实现了最优性能。
English Summary: This study introduces MAFMs for non-verbal emotion recognition, demonstrating their superiority over AAFMs through better capture of emotional structures and proposes RENO, a fusion framework achieving state-of-the-art performance by combining both models with a novel loss function.
Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pre-training, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance comparison to all the individual PTMs and baseline fusion techniques as well as setting SOTA.
本研究证明说话人识别预训练模型能通过捕捉细微声乐特征最有效预测歌声平均意见得分,并提出的BATCH融合框架实现了最优性能。
This study demonstrates that speaker recognition pre-trained models (SPTMs) most effectively predict Singing Voice Mean Opinion Scores by capturing fine-grained vocal features, and introduces a BATCH fusion framework achieving state-of-the-art performance.
Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Shubham Singh, Swarup Ranjan Behera, Vandana Rajan, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this work, we pioneer the study of Machine Unlearning (MU) for Paralinguistic Speech Processing (PSP). We focus on two key PSP tasks: Speech Emotion Recognition (SER) and Depression Detection (DD). To this end, we propose, SISA++, a novel extension to previous state-of-the-art (SOTA) MU method, SISA by merging models trained on different shards with weight-averaging. With such modifications, we show that SISA++ preserves performance more in comparison to SISA after unlearning in benchmark SER (CREMA-D) and DD (E-DAIC) datasets. Also, to guide future research for easier adoption of MU for PSP, we present ``cookbook recipes'' - actionable recommendations for selecting optimal feature representations and downstream architectures that can mitigate performance degradation after the unlearning process.
中文摘要:本研究针对副语言语音处理中的机器遗忘问题,提出SISA++新方法,通过加权模型融合在情感识别和抑郁检测任务中实现更优的性能保持,并为该领域提供可操作的实施方案指南。
English Summary: This study introduces SISA++, a novel machine unlearning method for paralinguistic speech processing that enhances performance preservation in emotion recognition and depression detection tasks through weighted model averaging.
Authors:Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
Abstract:
Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.
中文: 本研究通过特征归因技术发现,现代ASR模型依赖元音的完整时长和前两个共振峰(尤其在男性语音中),能更好捕捉咝擦音而非非咝擦音的频谱特征,并重点关注爆破音的除阻阶段,这些发现提升了模型可解释性并揭示了研究空白。
English: This study uses feature attribution to reveal that a modern ASR model prioritizes vowels' full duration and first two formants (especially in male speech), distinguishes sibilant fricatives better than non-sibilants, and focuses on plosives' release bursts, enhancing model interpretability and identifying research gaps.
Authors:Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli, Andre Martins, Giuseppe Attanasio
Abstract:
Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.
Chinese: 最新研究表明,传统编码器-解码器语音翻译模型能捕捉说话者性别信息,而新型适配器架构却无法实现,导致翻译结果更倾向于默认的男性表达偏见。
English: Recent research reveals that while traditional encoder-decoder speech translation models capture speaker gender information, newer adapter-based architectures fail to do so, leading to a stronger masculine default bias in translations.
Authors:Longyan Wu, Checheng Yu, Jieji Ren, Li Chen, Ran Huang, Guoying Gu, Hongyang Li
Abstract:
Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for human operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable data collection device with dual visuo-tactile grippers, which can be worn by human fingers for intuitive and natural control. A high-precision optical tracking system is introduced to capture end-effector poses, while synchronizing visual and tactile feedback simultaneously. FreeTacMan achieves multiple improvements in data collection performance compared to prior works, and enables effective policy learning for contact-rich manipulation tasks with the help of the visuo-tactile information. We will release the work to facilitate reproducibility and accelerate research in visuo-tactile manipulation.
中文:FreeTacMan是一种以人为中心的可穿戴抓取系统,通过收集多模态数据用于接触密集的机器人操作,其发布的数据集和硬件规格在效率和策略学习上均实现了显著提升。
English: FreeTacMan is a human-centric, wearable gripper system that collects multimodal data for contact-rich robot manipulation, achieving superior efficiency and enabling effective policy learning with its released dataset and hardware.
Authors:Longyan Wu, Checheng Yu, Jieji Ren, Li Chen, Yufei Jiang, Ran Huang, Guoying Gu, Hongyang Li
Abstract:
Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for human operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable gripper with dual visuo-tactile sensors for data collection, which can be worn by human fingers for intuitive control. A high-precision optical tracking system is introduced to capture end-effector poses while synchronizing visual and tactile feedback simultaneously. We leverage FreeTacMan to collect a large-scale multimodal dataset, comprising over 3000k paired visual-tactile images with end-effector poses, 10k demonstration trajectories across 50 diverse contact-rich manipulation tasks. FreeTacMan achieves multiple improvements in data collection performance compared to prior works, and enables effective policy learning for contact-rich manipulation tasks with self-collected dataset. The full suite of hardware specifications and the dataset will be released to facilitate reproducibility and support research in visuo-tactile manipulation.
中文:FreeTacMan是一种以人为中心的可穿戴抓取系统,通过收集多模态数据用于接触密集的机器人操作,其发布的数据集和硬件规格在效率和策略学习上均实现了显著提升。
English: FreeTacMan is a human-centric, wearable gripper system that collects multimodal data for contact-rich robot manipulation, achieving superior efficiency and enabling effective policy learning with its released dataset and hardware.
Authors:Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features--such as pitch, tone, rhythm, and intonation--into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations from SPTM pre-trained for paralinguistic speech processing, which excel in paralinguistic tasks like synthetic speech detection, speech emotion recognition has not been investigated for STSGS. We hypothesize that representations from paralinguistic SPTM will be more effective due to its ability to capture source-specific paralinguistic cues attributing to its paralinguistic pre-training. Our comparative study of representations from various SOTA SPTMs, including paralinguistic, monolingual, multilingual, and speaker recognition, validates this hypothesis. Furthermore, we explore fusion of representations and propose TRIO, a novel framework that fuses SPTMs using a gated mechanism for adaptive weighting, followed by canonical correlation loss for inter-representation alignment and self-attention for feature refinement. By fusing TRILLsson (Paralinguistic SPTM) and x-vector (Speaker recognition SPTM), TRIO outperforms individual SPTMs, baseline fusion methods, and sets new SOTA for STSGS in comparison to previous works.
Chinese: 本研究提出TRIO框架,通过门控融合和特征对齐技术整合副语言学和说话人识别的语音预训练模型,凭借其捕捉特定副语言特征的能力,在合成语音溯源任务中实现了最优性能。
English: This study introduces TRIO, a novel framework that fuses paralinguistic and speaker recognition speech pre-trained models using adaptive gating and alignment techniques, achieving state-of-the-art performance in synthetic speech source tracing by effectively capturing distinctive paralinguistic features.
Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma
Abstract:
In this study, we focus on heart murmur classification (HMC) and hypothesize that combining neural audio codec representations (NACRs) such as EnCodec with spectral features (SFs), such as MFCC, will yield superior performance. We believe such fusion will trigger their complementary behavior as NACRs excel at capturing fine-grained acoustic patterns such as rhythm changes, spectral features focus on frequency-domain properties such as harmonic structure, spectral energy distribution crucial for analyzing the complex of heart sounds. To this end, we propose, BAOMI, a novel framework banking on novel bandit-based cross-attention mechanism for effective fusion. Here, a agent provides more weightage to most important heads in multi-head cross-attention mechanism and helps in mitigating the noise. With BAOMI, we report the topmost performance in comparison to individual NACRs, SFs, and baseline fusion techniques and setting new state-of-the-art.
Chinese: 本研究提出了BAOMI框架,通过结合神经音频编解码器表示与频谱特征,并采用基于赌博机的交叉注意力机制进行有效融合,在心杂音分类任务中取得了最优性能,创造了新的技术标杆。
English: This study introduces BAOMI, a novel framework that combines neural audio codec representations with spectral features using a bandit-based cross-attention mechanism to achieve state-of-the-art performance in heart murmur classification.
Authors:Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Jaya Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma
Abstract:
The emergence of Mamba as an alternative to attention-based architectures has led to the development of Mamba-based self-supervised learning (SSL) pre-trained models (PTMs) for speech and audio processing. Recent studies suggest that these models achieve comparable or superior performance to state-of-the-art (SOTA) attention-based PTMs for speech emotion recognition (SER). Motivated by prior work demonstrating the benefits of PTM fusion across different speech processing tasks, we hypothesize that leveraging the complementary strengths of Mamba-based and attention-based PTMs will enhance SER performance beyond the fusion of homogenous attention-based PTMs. To this end, we introduce a novel framework, PARROT that integrates parallel branch fusion with Optimal Transport and Hadamard Product. Our approach achieves SOTA results against individual PTMs, homogeneous PTMs fusion, and baseline fusion techniques, thus, highlighting the potential of heterogeneous PTM fusion for SER.
中文: PARROT框架通过结合Mamba与注意力预训练模型,并采用最优传输和哈达玛积的并行分支融合方法,在语音情感识别中实现了最先进的性能,凸显了异构模型融合的潜力。
English: The PARROT framework, which integrates Mamba-based and attention-based pre-trained models through parallel branch fusion with Optimal Transport and Hadamard Product, achieves state-of-the-art performance in speech emotion recognition, demonstrating the superiority of heterogeneous model fusion.
Authors:Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao
Abstract:
Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
Chinese: ARIA通过将自然语言动作映射到低维意图空间进行奖励聚合,有效解决了大型语言模型智能体在开放环境中因奖励稀疏导致的训练难题,显著降低了方差并提升了任务表现。
English: ARIA addresses the challenge of extreme reward sparsity in large language model agents by projecting natural language actions into a low-dimensional intention space for reward aggregation, significantly reducing variance and improving performance across multiple tasks.
Authors:Javier Conde, Miguel González, MarÃa Grandury, Gonzalo MartÃnez, Pedro Reviriego, Mar Brysbaert
Abstract:
The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolor{black}{generally} better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.
中文摘要:当前大语言模型的评估正从客观任务表现扩展到与人类心理语言学评分的契合度,结果显示模型在感官关联方面存在局限,这源于其缺乏人类的具身认知能力。
English Summary: Current LLM evaluations are expanding from objective task performance to include alignment with human psycholinguistic ratings, revealing limitations in capturing sensory associations due to lacking embodied cognition.
Authors:Sadra Safadoust, Fabio Tosi, Fatma Güney, Matteo Poggi
Abstract:
We introduce WarpRF, a training-free general-purpose framework for quantifying the uncertainty of radiance fields. Built upon the assumption that photometric and geometric consistency should hold among images rendered by an accurate model, WarpRF quantifies its underlying uncertainty from an unseen point of view by leveraging backward warping across viewpoints, projecting reliable renderings to the unseen viewpoint and measuring the consistency with images rendered there. WarpRF is simple and inexpensive, does not require any training, and can be applied to any radiance field implementation for free. WarpRF excels at both uncertainty quantification and downstream tasks, e.g., active view selection and active mapping, outperforming any existing method tailored to specific frameworks.
中文:WarpRF是一种无需训练的多功能框架,通过反向扭曲测量视角间一致性来量化辐射场的不确定性,在主动视图选择等任务中表现卓越且无需额外训练。
English: WarpRF is a training-free framework that quantifies radiance field uncertainty by leveraging backward warping to measure consistency across viewpoints, excelling in tasks like active view selection without requiring any training.
Authors:Weihua Xiao, Derek Ekberg, Siddharth Garg, Ramesh Karri
Abstract:
SystemVerilog Assertions (SVAs) are critical for verifying the correctness of hardware designs, but manually writing them from natural language property descriptions, i.e., NL2SVA, remains a labor-intensive and error-prone task. Recent advances in large language models (LLMs) offer opportunities to automate this translation. However, existing models still struggle with understanding domain-specific syntax and semantics. To enhance LLM performance in NL2SVA, we propose a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset that together improve LLM's performance. To further improve lightweight models over NL2SVA, our fine-tuning dataset provides prompt-guided explanations that teach LLMs the layer-by-layer construction process of concurrent SVAs, enabling supervised fine-tuning that greatly improves syntax and functionality accuracy. To evaluate the performance of LLMs over NL2SVA, we construct the largest evaluation dataset for NL2SVA, comprising 40 Verilog designs and 229 formally verified SVAs with detailed annotations. Experimental results show that our customized RAG framework increases the number of functionality matched SVAs by 58.42% over GPT-4o-mini, while Qwen2.5-Coder-7B-Instruct fine-tuned on our fine-tuning dataset and integrated with HybridRetrieval achieves a 59.05% over the base Qwen model.
中文摘要:本研究提出定制化检索增强生成框架和合成微调数据集,显著提升大语言模型将自然语言属性描述转换为SystemVerilog断言的能力,相比基线模型在功能准确性上实现超过58%的改进。
English Summary: This study introduces a customized retrieval-augmented generation framework and synthetic fine-tuning dataset to enhance large language models' ability to translate natural language descriptions into SystemVerilog Assertions, achieving over 58% improvement in functionality accuracy compared to baseline models.
Authors:Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Chan, Herui Yao, Hao Chen
Abstract:
Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs. Crucially, it achieves AUC $\geq$ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks and C-index $\geq$ 0.70 in survival cohorts of 5 major cancer types. Moreover, PathLUPI embeddings reveal distinct cellular morphological signatures associated with specific genotypes and related biological pathways within WSIs. By effectively encoding molecular context to refine WSI representations, PathLUPI overcomes a key limitation of existing models and offers a novel strategy to bridge molecular insights with routine pathology workflows for wider clinical application.
Chinese: PathLUPI在训练中利用转录组信息生成基因组锚定的组织学嵌入,仅需全切片图像即可实现精确的分子预测和预后评估,在多项肿瘤学任务中展现出卓越性能。
English: PathLUPI leverages transcriptomic data during training to create genome-anchored histological embeddings from whole-slide images, enabling accurate molecular predictions and prognosis assessments with superior performance across multiple oncology tasks.
Authors:Jinlong Li, Dong Zhao, Qi Zang, Zequn Jie, Lin Ma, Nicu Sebe
Abstract:
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios with changing target distributions. Existing CTTA methods primarily focus on mitigating the challenges of catastrophic forgetting and error accumulation. Though there have been emerging methods based on forgetting adaptation with parameter-efficient fine-tuning, they still struggle to balance competitive performance and efficient model adaptation, particularly in complex tasks like semantic segmentation. In this paper, to tackle the above issues, we propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk. Specifically, we first project a tuning subspace orthogonally which allows the model to adapt to new domains while preserving the knowledge integrity of the pre-trained source model to alleviate catastrophic forgetting. Then, we elaborate an online prior-knowledge aggregation strategy that employs an aggressive yet efficient image masking strategy to mimic potential target dynamism, enhancing the student model's domain adaptability. This further gradually ameliorates the teacher model's knowledge, ensuring high-quality pseudo labels and reducing error accumulation. We demonstrate our method with extensive experiments that surpass previous CTTA methods and achieve competitive performances across various continual TTA benchmarks in semantic segmentation tasks.
中文摘要:本文提出OoPk方法,通过正交投影保持源知识完整性,结合在线先验知识聚合策略提升域适应能力,在语义分割任务中实现了优于现有CTTA方法的性能。
English Summary: This paper introduces OoPk, a novel CTTA method that uses orthogonal projection to preserve source knowledge and an online aggregation strategy to enhance domain adaptability, achieving superior performance in semantic segmentation tasks.
Authors:Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
Abstract:
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B
大语言模型在生成长文本时面临长度限制和质量下降的挑战,而我们的激励方法通过强化学习无需合成数据,在长度控制、写作质量和结构优化方面取得最优性能。
Large language models face challenges in generating ultra-long texts due to length limits and quality degradation, but our incentivization approach using reinforcement learning without synthetic data achieves state-of-the-art performance by improving length control, writing quality, and structure.
Authors:Raquel Ferrando, Javier Conde, Gonzalo MartÃnez, Pedro Reviriego
Abstract:
The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.
Chinese: 针对聊天机器人对话优化的分词器可将对话中的令牌数量减少5%至10%,在实现显著节能的同时,对原始训练语料的分词效率影响甚微甚至略有提升。
English: Optimizing tokenizers for chatbot conversations can reduce token counts by 5% to 10%, leading to significant energy savings without compromising efficiency on the original training corpus.
Authors:Quang Nguyen, Tri Le, Huy Nguyen, Thieu Vo, Tung D. Ta, Baoru Huang, Minh N. Vu, Anh Nguyen
Abstract:
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large-scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach. Our project page is available at https://zquang2202.github.io/GraspMAS
中文总结:本文提出的GraspMAS多智能体系统通过规划器、编码器和观察器的协同工作,有效解决了语言驱动抓取检测在复杂指令理解和跨领域适应方面的难题,实验证明其性能显著优于现有方法。
English Summary: The paper introduces GraspMAS, a multi-agent system that enhances language-driven grasp detection by addressing challenges in interpreting complex instructions and adapting to new domains without retraining, demonstrating superior performance in experiments.
Authors:Filippo Ruffini, Elena Mulero Ayllon, Linlin Shen, Paolo Soda, Valerio Guarrasi
Abstract:
Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model's capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.
中文: 本研究建立了一个结构化基准,通过胸片评估AI模型在COVID-19预后预测中的迁移能力,全面比较不同数据场景下的微调策略,为临床AI应用提供实践指导。
English: This study establishes a structured benchmark to evaluate the transferability of AI models for COVID-19 prognosis prediction using chest X-rays, comprehensively comparing fine-tuning strategies across diverse data scenarios to guide clinical AI deployment.
Authors:Tahsin Alamgir Kheya, Mohamed Reda Bouadjenek, Sunil Aryal
Abstract:
Recommendation systems play a crucial role in our daily lives by impacting user experience across various domains, including e-commerce, job advertisements, entertainment, etc. Given the vital role of such systems in our lives, practitioners must ensure they do not produce unfair and imbalanced recommendations. Previous work addressing bias in recommendations overlooked bias in certain item categories, potentially leaving some biases unaddressed. Additionally, most previous work on fair re-ranking focused on binary-sensitive attributes. In this paper, we address these issues by proposing a fairness-aware re-ranking approach that helps mitigate bias in different categories of items. This re-ranking approach leverages existing biases to correct disparities in recommendations across various demographic groups. We show how our approach can mitigate bias on multiple sensitive attributes, including gender, age, and occupation. We experimented on three real-world datasets to evaluate the effectiveness of our re-ranking scheme in mitigating bias in recommendations. Our results show how this approach helps mitigate social bias with little to no degradation in performance.
中文: 本文提出了一种公平感知的重排序方法,通过利用现有偏见来纠正推荐系统中针对性别、年龄和职业等多敏感属性的偏差,实验表明该方法能有效减少社会偏见且几乎不影响性能。
English: This paper introduces a fairness-aware re-ranking method that mitigates bias across multiple sensitive attributes like gender, age, and occupation in recommendation systems, demonstrating effective bias reduction with minimal performance impact on real-world datasets.
Authors:Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, Irwin King, Fakhri Karray, Philip S. Yu
Abstract:
AI agents have experienced a paradigm shift, from early dominance by reinforcement learning (RL) to the rise of agents powered by large language models (LLMs), and now further advancing towards a synergistic fusion of RL and LLM capabilities. This progression has endowed AI agents with increasingly strong abilities. Despite these advances, to accomplish complex real-world tasks, agents are required to plan and execute effectively, maintain reliable memory, and coordinate smoothly with other agents. Achieving these capabilities involves contending with ever-present intricate information, operations, and interactions. In light of this challenge, data structurization can play a promising role by transforming intricate and disorganized data into well-structured forms that agents can more effectively understand and process. In this context, graphs, with their natural advantage in organizing, managing, and harnessing intricate data relationships, present a powerful data paradigm for structurization to support the capabilities demanded by advanced AI agents. To this end, this survey presents a first systematic review of how graphs can empower AI agents. Specifically, we explore the integration of graph techniques with core agent functionalities, highlight notable applications, and identify prospective avenues for future research. By comprehensively surveying this burgeoning intersection, we hope to inspire the development of next-generation AI agents equipped to tackle increasingly sophisticated challenges with graphs. Related resources are collected and continuously updated for the community in the Github link.
中文: 本综述系统探讨了图如何通过结构化复杂数据来增强AI智能体的规划、记忆和协调能力,并探索了其集成方法、应用场景及未来研究方向。
English: This survey systematically examines how graphs can enhance AI agents by structuring complex data to improve planning, memory, and coordination, while exploring integrations, applications, and future research directions.
Authors:Chuanlei Li, Xu Hu, Minghui Xu, Kun Li, Yue Zhang, Xiuzhen Cheng
Abstract:
Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring. Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average \$104.28 USD). However, the similarity between LLM-selected papers and actual accepted papers remains low (average 38.6\%), indicating issues such as hallucination, lack of independent judgment, and retrieval preferences. Therefore, it is recommended to use LLMs as assistive tools to support human reviewers, rather than to replace them.
中文: 研究表明,利用大语言模型进行学术论文审稿能显著降低时间和成本,但其与人工选择的低相似度揭示了幻觉和判断力不足等局限,因此建议作为辅助工具而非替代人工审稿。
English: This study demonstrates that using Large Language Models (LLMs) for academic paper review can significantly reduce time and cost, but their low similarity with human selections highlights limitations like hallucinations and judgment issues, suggesting they should assist rather than replace human reviewers.
Authors:Yaru Niu, Yunzhe Zhang, Mingyang Yu, Changyi Lin, Chenhao Li, Yikai Wang, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Zhenzhen Li, Jonathan Francis, Bingqing Chen, Jie Tan, Ding Zhao
Abstract:
Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a cross-embodiment imitation learning system for quadrupedal manipulation, leveraging data collected from both humans and LocoMan, a quadruped equipped with multiple manipulation modes. Specifically, we develop a teleoperation and data collection pipeline, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient modularized architecture that supports co-training and pretraining on structured modality-aligned data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. We validate our system on six real-world manipulation tasks, where it achieves an average success rate improvement of 41.9% overall and 79.7% under out-of-distribution (OOD) settings compared to the baseline. Pretraining with human data contributes a 38.6% success rate improvement overall and 82.7% under OOD settings, enabling consistently better performance with only half the amount of robot data. Our code, hardware, and data are open-sourced at: https://human2bots.github.io.
中文: 本研究提出了一种跨具身模仿学习系统,通过结合人类与机器人数据进行协同训练,显著提升了四足机器人在现实任务中的操作能力。
English: This study presents a cross-embodiment imitation learning system that enhances quadrupedal robots' manipulation skills by co-training with human and robot data, achieving significant performance improvements in real-world tasks.
Authors:Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang
Abstract:
Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
This paper introduces WinWinLay, a training-free method that enhances layout-to-image generation through a Non-local Attention Energy Function for precise object localization and an Adaptive Update scheme to maintain realistic outputs, outperforming existing approaches.
English Summary:
Authors:Shiyu Cheng, Luyao Niu, Bhaskar Ramasubramanian, Andrew Clark, Radha Poovendran
Abstract:
In multi-agent systems, signal temporal logic (STL) is widely used for path planning to accomplish complex objectives with formal safety guarantees. However, as the number of agents increases, existing approaches encounter significant computational challenges. Recognizing that many complex tasks require cooperation among multiple agents, we propose swarm STL specifications to describe the collective tasks that need to be achieved by a team of agents. Next, we address the motion planning problem for all the agents in two stages. First, we abstract a group of cooperating agents as a swarm and construct a reduced-dimension state space whose dimension does not increase with the number of agents. The path planning is performed at the swarm level, ensuring the safety and swarm STL specifications are satisfied. Then, we design low-level control strategies for agents within each swarm based on the path synthesized in the first step. The trajectories of agents generated by the two-step policy ensure satisfaction of the STL specifications. We evaluate our two-stage approach in both single-swarm and multi-swarm scenarios. The results demonstrate that all tasks are completed with safety guarantees. Compared to the baseline multi-agent planning approach, our method maintains computational efficiency as the number of agents increases, since the computational time scales with the number of swarms rather than the number of agents.
Chinese: 本文提出了一种采用群体信号时序逻辑的两阶段运动规划方法,首先在群体层面进行路径规划以确保安全性和任务规范,然后设计个体代理控制策略,随着代理数量增加,计算效率得以保持。
English: This paper introduces a two-stage motion planning method using swarm signal temporal logic to efficiently manage multi-agent systems by first planning at the swarm level to ensure safety and task specifications, then designing individual agent controls, which maintains computational efficiency as the number of agents grows.
Authors:Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang
Abstract:
Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper's acceptance.
中文摘要:LingoLoop是一种新型攻击方法,通过利用词性特征和限制输出多样性,迫使多模态大语言模型生成过度冗长和重复的内容,使生成标记数量增加高达30倍,能耗同比上升。
English Summary: LingoLoop is a novel attack that exploits Part-of-Speech characteristics and output diversity constraints to force Multimodal Large Language Models into generating excessively verbose and repetitive outputs, increasing token generation by up to 30 times and energy consumption proportionally.
Authors:Pengfei Wang, Qiujie Dong, Fangtian Liang, Hao Pan, Lei Yang, Congyi Zhang, Guying Lin, Caiming Zhang, Yuanfeng Zhou, Changhe Tu, Shiqing Xin, Alla Sheffer, Xin Li, Wenping Wang
Abstract:
Neural implicit shape representation has drawn significant attention in recent years due to its smoothness, differentiability, and topological flexibility. However, directly modeling the shape of a neural implicit surface, especially as the zero-level set of a neural signed distance function (SDF), with sparse geometric control is still a challenging task. Sparse input shape control typically includes 3D curve networks or, more generally, 3D curve sketches, which are unstructured and cannot be connected to form a curve network, and therefore more difficult to deal with. While 3D curve networks or curve sketches provide intuitive shape control, their sparsity and varied topology pose challenges in generating high-quality surfaces to meet such curve constraints. In this paper, we propose NeuVAS, a variational approach to shape modeling using neural implicit surfaces constrained under sparse input shape control, including unstructured 3D curve sketches as well as connected 3D curve networks. Specifically, we introduce a smoothness term based on a functional of surface curvatures to minimize shape variation of the zero-level set surface of a neural SDF. We also develop a new technique to faithfully model G0 sharp feature curves as specified in the input curve sketches. Comprehensive comparisons with the state-of-the-art methods demonstrate the significant advantages of our method.
中文: 本文提出NeuVAS方法,通过引入基于曲率的平滑项和尖锐特征建模技术,实现了在稀疏三维曲线约束下对神经隐式曲面的变分形状建模,显著优于现有方法。
English: This paper introduces NeuVAS, a variational method for neural implicit shape modeling that effectively handles sparse geometric constraints like 3D curve sketches and networks by incorporating curvature-based smoothness and sharp feature preservation.
Authors:Pengfei Wang, Qiujie Dong, Fangtian Liang, Hao Pan, Lei Yang, Congyi Zhang, Guying Lin, Caiming Zhang, Yuanfeng Zhou, Changhe Tu, Shiqing Xin, Alla Sheffer, Xin Li, Wenping Wang
Abstract:
Neural implicit shape representation has drawn significant attention in recent years due to its smoothness, differentiability, and topological flexibility. However, directly modeling the shape of a neural implicit surface, especially as the zero-level set of a neural signed distance function (SDF), with sparse geometric control is still a challenging task. Sparse input shape control typically includes 3D curve networks or, more generally, 3D curve sketches, which are unstructured and cannot be connected to form a curve network, and therefore more difficult to deal with. While 3D curve networks or curve sketches provide intuitive shape control, their sparsity and varied topology pose challenges in generating high-quality surfaces to meet such curve constraints. In this paper, we propose NeuVAS, a variational approach to shape modeling using neural implicit surfaces constrained under sparse input shape control, including unstructured 3D curve sketches as well as connected 3D curve networks. Specifically, we introduce a smoothness term based on a functional of surface curvatures to minimize shape variation of the zero-level set surface of a neural SDF. We also develop a new technique to faithfully model G0 sharp feature curves as specified in the input curve sketches. Comprehensive comparisons with the state-of-the-art methods demonstrate the significant advantages of our method.
中文: 本文提出NeuVAS方法,通过引入基于曲率的平滑项和尖锐特征建模技术,实现了在稀疏三维曲线约束下对神经隐式曲面的变分形状建模,显著优于现有方法。
English: This paper introduces NeuVAS, a variational method for neural implicit shape modeling that effectively handles sparse geometric constraints like 3D curve sketches and networks by incorporating curvature-based smoothness and sharp feature preservation.
Authors:Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jeremy H. M Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen
Abstract:
Creating a unified speech and music model requires expensive pre-training. Model merging can instead create an unified audio model with minimal computational expense. However, direct merging is challenging when the models are not aligned in the weight space. Motivated by Git Re-Basin, we introduce a correlation-permutation approach that aligns a music encoder's internal layers with a speech encoder. We extend previous work to the case of merging transformer layers. The method computes a permutation matrix that maximizes the model's features-wise cross-correlations layer by layer, enabling effective fusion of these otherwise disjoint models. The merged model retains speech capabilities through this method while significantly enhancing music performance, achieving an improvement of 14.83 points in average score compared to linear interpolation model merging. This work allows the creation of unified audio models from independently trained encoders.
Chinese: 通过相关性置换方法进行模型融合,有效对齐并整合独立训练的语音与音乐编码器,构建统一音频模型,在保持语音能力的同时将音乐性能提升14.83分。
English: Model merging via a correlation-permutation approach effectively aligns and fuses independently trained speech and music encoders, enabling a unified audio model that retains speech capabilities while boosting music performance by 14.83 points over baseline methods.
Authors:Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, Jingren Zhou
Abstract:
We present a joint forecasting framework for time series prediction that contrasts with traditional direct or recursive methods. This framework achieves state-of-the-art performance for our designed foundation model, YingLong, and reveals a novel scaling effect: longer outputs significantly enhance model accuracy due to delayed chain-of-thought reasoning in our non-causal approach. YingLong is a non-causal, bidirectional attention encoder-only transformer trained through masked token recovery, aligning more effectively with language understanding tasks than with generation tasks. Additionally, we boost performance by tackling output variance with a multi-input ensemble. We release four foundation models ranging from 6M to 300M parameters, demonstrating superior results in zero-shot tasks on the ETT and Weather datasets. YingLong achieves more than 60% best performance. To ensure generalizability, we assessed the models using the GIFT-Eval benchmark, which comprises 23 time series datasets across 7 domains. Yinglong significantly outperformed the best time-series foundation models, end-to-end trained models by 14% and 44% in rank respectively.The pretrained 300M model is available at https://huggingface.co/qcw1314/YingLong_300m
中文: 我们提出YingLong联合预测框架,采用非因果双向Transformer,通过延迟思维链推理和多输入集成实现最先进性能,在多个数据集上展现出卓越的零样本预测能力。
English: We introduce YingLong, a joint forecasting framework using a non-causal bidirectional transformer that achieves state-of-the-art performance through delayed chain-of-thought reasoning and multi-input ensemble, demonstrating superior zero-shot results across multiple datasets.
Authors:Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Abstract:
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.
中文: M4V框架采用多模态Mamba架构,通过创新的令牌重组设计和奖励学习策略,在保持高质量视频生成的同时,将计算成本显著降低45%。
English: The M4V framework introduces a multi-modal Mamba architecture that significantly reduces computational costs by 45% while maintaining high-quality video generation through innovative token re-composition and reward learning strategies.
Authors:Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
Abstract:
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
中文: 本文提出EmoNet-Voice,这是一个包含大规模预训练数据集和专家标注基准的综合语音情感识别资源,通过心理学专家验证的40种细粒度情感分类系统,在保护隐私的同时提升了情感检测的精确度。
English: This paper introduces EmoNet-Voice, a comprehensive synthetic dataset and benchmark for evaluating speech emotion recognition models across 40 fine-grained emotions, validated by psychology experts to ensure accuracy and privacy.
Authors:Huu Hung Nguyen, Duc Manh Tran, Yiran Cheng, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Shar Lwin Khin, Ouh Eng Lieh, Ting Zhang, David Lo
Abstract:
Mapping National Vulnerability Database (NVD) records to vulnerability-fixing commits (VFCs) is crucial for vulnerability analysis but challenging due to sparse explicit links in NVD references.This study explores this mapping's feasibility through an empirical approach. Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. Using these findings, we built an automated pipeline extracting 31,942 VFCs from 20,360 NVD records (8.7% of 235,341) with 87% precision, mainly from Git references. To fill gaps, we mined six external security databases, yielding 29,254 VFCs for 18,985 records (8.1%) at 88.4% precision, and GitHub repositories, adding 3,686 VFCs for 2,795 records (1.2%) at 73% precision. Combining these, we mapped 26,710 unique records (11.3% coverage) from 7,634 projects, with overlap between NVD and external databases, plus unique GitHub contributions. Despite success with Git references, 88.7% of records remain unmapped, highlighting the difficulty without Git links. This study offers insights for enhancing vulnerability datasets and guiding future automated security research.
中文摘要:本研究通过Git引用成功实现了国家漏洞数据库记录与漏洞修复提交的高精度映射,但仍有88.7%的记录因缺乏Git链接而无法匹配,揭示了自动化漏洞分析面临的挑战。
English Summary: This study successfully mapped vulnerability-fixing commits to NVD records using Git references with high precision but found 88.7% of records remain unmapped due to lack of Git links, highlighting challenges in automated vulnerability analysis.
Authors:Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Chunyu Miao, Dongyuan Li, Aiwei Liu, Yue Zhou, Yankai Chen, Weizhi Zhang, Yangning Li, Liancheng Fang, Renhe Jiang, Philip S. Yu
Abstract:
Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems (LLM-HAS), where AI works with humans rather than replacing them. By keeping human involved to provide guidance, answer questions, and maintain control, these systems can be more trustworthy and adaptable. Looking at examples from healthcare, finance, and software development, we show how human-AI teamwork can handle complex tasks better than AI working alone. We also discuss the challenges of building these collaborative systems and offer practical solutions. This paper argues that progress in AI should not be measured by how independent systems become, but by how well they can work with humans. The most promising future for AI is not in systems that take over human roles, but in those that enhance human capabilities through meaningful partnership.
中文: 本立场文件主张发展基于大语言模型的人机协作系统,而非完全自主的AI代理,强调在医疗、金融等领域通过人机协同能提升系统可靠性、透明度和适应能力,实现增强人类智能的合作伙伴关系。
English: This position paper advocates for LLM-based Human-Agent Systems (LLM-HAS) over fully autonomous AI agents, arguing that human-AI collaboration ensures greater reliability, transparency, and adaptability in complex tasks across fields like healthcare and finance.
Authors:Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Youwei Pang, Longxiang Tang, Chengyu Fang, Yulun Zhang, Linghe Kong, Xiu Li, Sina Farsiu
Abstract:
Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, ``\emph{Segment Anything Model (SAM)}'', to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.
中文: 本文提出SEE框架,通过SAM生成高质量伪标签和混合粒度特征分组,解决了不完全监督隐蔽物体分割中的监督不足和内在相似性难题,实现了最先进的性能。
English: This paper introduces SEE, a unified mean-teacher framework for Incompletely-Supervised Concealed Object Segmentation (ISCOS) that leverages SAM to generate high-quality pseudo-labels and employs hybrid-granularity feature grouping to address incomplete supervision and intrinsic similarity challenges, achieving state-of-the-art performance.
Authors:Raghu Vamshi Hemadri, Jitendra Bhandari, Andre Nakkab, Johann Knechtel, Badri P Gopalan, Ramesh Narayanaswamy, Ramesh Karri, Siddharth Garg
Abstract:
Modern chip design is complex, and there is a crucial need for early-stage prediction of key design-quality metrics like timing and routing congestion directly from Verilog code (a commonly used programming language for hardware design). It is especially important yet complex to predict individual lines of code that cause timing violations or downstream routing congestion. Prior works have tried approaches like converting Verilog into an intermediate graph representation and using LLM embeddings alongside other features to predict module-level quality, but did not consider line-level quality prediction. We propose VeriLoC, the first method that predicts design quality directly from Verilog at both the line- and module-level. To this end, VeriLoC leverages recent Verilog code-generation LLMs to extract local line-level and module-level embeddings, and train downstream classifiers/regressors on concatenations of these embeddings. VeriLoC achieves high F1-scores of 0.86-0.95 for line-level congestion and timing prediction, and reduces the mean average percentage error from 14% - 18% for SOTA methods down to only 4%. We believe that VeriLoC embeddings and insights from our work will also be of value for other predictive and optimization tasks for complex hardware design.
Chinese: VeriLoC是一种创新方法,利用Verilog代码生成大语言模型提取嵌入特征,可同时预测代码行级和模块级的设计质量指标,在时序和拥塞预测方面实现高精度,相比现有最优方法显著降低了误差。
English: VeriLoC is a novel method that uses Verilog code-generation LLMs to extract embeddings for predicting both line- and module-level design quality metrics, achieving high accuracy in timing and congestion prediction with significant error reduction compared to state-of-the-art methods.
Authors:Haiqi Yang, Zhiyuan Li, Yi Chang, Yuan Wu
Abstract:
Retentive Network (RetNet) represents a significant advancement in neural network architecture, offering an efficient alternative to the Transformer. While Transformers rely on self-attention to model dependencies, they suffer from high memory costs and limited scalability when handling long sequences due to their quadratic complexity. To mitigate these limitations, RetNet introduces a retention mechanism that unifies the inductive bias of recurrence with the global dependency modeling of attention. This mechanism enables linear-time inference, facilitates efficient modeling of extended contexts, and remains compatible with fully parallelizable training pipelines. RetNet has garnered significant research interest due to its consistently demonstrated cross-domain effectiveness, achieving robust performance across machine learning paradigms including natural language processing, speech recognition, and time-series analysis. However, a comprehensive review of RetNet is still missing from the current literature. This paper aims to fill that gap by offering the first detailed survey of the RetNet architecture, its key innovations, and its diverse applications. We also explore the main challenges associated with RetNet and propose future research directions to support its continued advancement in both academic research and practical deployment.
中文:RetNet通过引入保留机制,作为Transformer的高效替代方案,实现了线性时间推理,并在自然语言处理和语音识别等领域表现优异,本文首次对其架构、应用及未来研究方向进行了全面综述。
English: RetNet introduces a retention mechanism as an efficient alternative to Transformers, enabling linear-time inference and robust performance across domains like NLP and speech recognition, with this paper providing the first comprehensive survey of its architecture, applications, and future directions.
Authors:Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li
Abstract:
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.
中文摘要:PersonaAgent首创了个性化大语言模型智能体框架,通过记忆与行动双模块的协同运作及实时用户偏好对齐策略,在个性化任务中显著优于现有基准方法,实现了真正动态定制的用户体验。
English Summary: PersonaAgent is a pioneering personalized LLM agent framework that integrates memory and action modules to dynamically adapt to individual user preferences through real-time persona optimization, significantly outperforming existing methods.
Authors:Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Abstract:
Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
中文摘要:CO-VADA是一种基于语音转换的增强去偏方法,无需修改模型架构或依赖人口统计信息,通过生成增强样本来消除语音情感识别系统中的偏见。
English Summary: CO-VADA is a voice augmentation debiasing approach that mitigates bias in speech emotion recognition systems by generating augmented samples through voice conversion, without requiring model modifications or demographic data.
Authors:Yi-Cheng Lin, Huang-Cheng Chou, Yu-Hsuan Li Liang, Hung-yi Lee
Abstract:
Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.
中文: 本研究提出EMO-Debias,通过对13种多标签语音情感识别去偏方法的系统评估,揭示了公平性与准确性的权衡关系,并确定了在保持模型性能的同时有效减少性别偏见的关键策略。
English: This study introduces EMO-Debias, a comprehensive evaluation of 13 debiasing methods for multi-label speech emotion recognition, revealing trade-offs between fairness and accuracy while identifying strategies that reduce gender bias without compromising performance.
Authors:Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee
Abstract:
Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.
Chinese: SuperWriter-Agent通过引入结构化思维规划与优化阶段的新型框架,显著提升长文本生成质量,其训练的70亿参数SuperWriter-LM模型采用分层DPO优化方法,在多项评测中实现最优性能。
English: SuperWriter-Agent is a novel framework that enhances long-form text generation by integrating structured planning and refinement stages, with its trained 7B SuperWriter-LM achieving state-of-the-art performance through hierarchical DPO optimization.
Authors:Kaiyan Chang, Mingzhi Chen, Yunji Chen, Zhirong Chen, Dongrui Fan, Junfeng Gong, Nan Guo, Yinhe Han, Qinfen Hao, Shuo Hou, Xuan Huang, Pengwei Jin, Changxin Ke, Cangyuan Li, Guangli Li, Huawei Li, Kuan Li, Naipeng Li, Shengwen Liang, Cheng Liu, Hongwei Liu, Jiahua Liu, Junliang Lv, Jianan Mu, Jin Qin, Bin Sun, Chenxi Wang, Duo Wang, Mingjun Wang, Ying Wang, Chenggang Wu, Peiyang Wu, Teng Wu, Xiao Xiao, Mengyao Xie, Chenwei Xiong, Ruiyuan Xu, Mingyu Yan, Xiaochun Ye, Kuai Yu, Rui Zhang, Shuoming Zhang, Jiacheng Zhao
Abstract:
Computer System Architecture serves as a crucial bridge between software applications and the underlying hardware, encompassing components like compilers, CPUs, coprocessors, and RTL designs. Its development, from early mainframes to modern domain-specific architectures, has been driven by rising computational demands and advancements in semiconductor technology. However, traditional paradigms in computer system architecture design are confronting significant challenges, including a reliance on manual expertise, fragmented optimization across software and hardware layers, and high costs associated with exploring expansive design spaces. While automated methods leveraging optimization algorithms and machine learning have improved efficiency, they remain constrained by a single-stage focus, limited data availability, and a lack of comprehensive human domain knowledge. The emergence of large language models offers transformative opportunities for the design of computer system architecture. By leveraging the capabilities of LLMs in areas such as code generation, data analysis, and performance modeling, the traditional manual design process can be transitioned to a machine-based automated design approach. To harness this potential, we present the Large Processor Chip Model (LPCM), an LLM-driven framework aimed at achieving end-to-end automated computer architecture design. The LPCM is structured into three levels: Human-Centric; Agent-Orchestrated; and Model-Governed. This paper utilizes 3D Gaussian Splatting as a representative workload and employs the concept of software-hardware collaborative design to examine the implementation of the LPCM at Level 1, demonstrating the effectiveness of the proposed approach. Furthermore, this paper provides an in-depth discussion on the pathway to implementing Level 2 and Level 3 of the LPCM, along with an analysis of the existing challenges.
中文: 计算机系统架构连接软件与硬件,其发展面临依赖人工设计和优化碎片化等挑战,因此提出了基于大语言模型的LPCM框架,旨在实现端到端的自动化设计。
English: Computer System Architecture bridges software and hardware, and its evolution faces challenges like manual design reliance and fragmented optimization, prompting the development of the LPCM framework using large language models for automated, end-to-end design.
Authors:Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Abstract:
Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.
中文摘要:本研究提出了一种结合三维对比视觉语言预训练与潜在扩散模型的新型文本到CT生成方法,在根据文本描述合成具有临床意义的CT体积方面展现出卓越性能。
English Summary: This study introduces a novel text-to-CT generation method combining 3D contrastive vision-language pretraining with latent diffusion models, demonstrating superior performance in synthesizing clinically relevant CT volumes from text descriptions.
Authors:Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search. Extensive experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.
Chinese: FLoE是一种创新的参数高效微调框架,通过费舍尔信息引导和贝叶斯优化动态分配稀疏适配器与最优秩,在资源受限环境下为大语言模型实现了卓越的效率-精度平衡。
English: FLoE is a novel parameter-efficient fine-tuning framework that employs Fisher information and Bayesian optimization to dynamically allocate sparse adapters and optimal ranks, achieving superior efficiency-accuracy trade-offs for large language models in resource-limited settings.
Authors:Isabelle Krauss, Victor G. Lopez, Matthias A. Müller
Abstract:
Sample-based observability characterizes the ability to reconstruct the internal state of a dynamical system by using limited output information, i.e., when measurements are only infrequently and/or irregularly available. In this work, we investigate the concept of functional observability, which refers to the ability to infer a function of the system state from the outputs, within a samplebased framework. Here, we give necessary and sufficient conditions for a system to be sample-based functionally observable, and formulate conditions on the sampling schemes such that these are satisfied. Furthermore, we provide a numerical example, where we demonstrate the applicability of the obtained results.
中文: 本研究探讨了在有限输出采样下动态系统的函数可观测性,确立了推断状态函数的充要条件并规定了有效采样方案,同时通过数值示例验证了结果的适用性。
English: This work explores functional observability in dynamical systems with limited output sampling, establishing necessary and sufficient conditions for inferring state functions and specifying valid sampling schemes, supported by a numerical demonstration.
Authors:Sijie He, Ziye Jia, Qiuming Zhu, Fuhui Zhou, Qihui Wu
Abstract:
Due to the scalability and portability, the low-altitude intelligent networks (LAINs) are essential in various fields such as surveillance and disaster rescue. However, in LAINs, unmanned aerial vehicles (UAVs) are characterized by the distributed topology and high dynamic mobility, and vulnerable to security threats, which may degrade the routing performance for data transmission. Hence, how to ensure the routing stability and security of LAINs is a challenge. In this paper, we focus on the routing process in LAINs with multiple UAV clusters and propose the blockchain-enabled zero-trust architecture to manage the joining and exiting of UAVs. Furthermore, we formulate the routing problem to minimize the end-to-end (E2E) delay, which is an integer linear programming and intractable to solve. Therefore, considering the distribution of LAINs, we reformulate the routing problem into a decentralized partially observable Markov decision process. With the proposed soft hierarchical experience replay buffer, the multi-agent double deep Q-network based adaptive routing algorithm is designed. Finally, simulations are conducted and numerical results show that the total E2E delay of the proposed mechanism decreases by 22.38\% than the benchmark on average.
中文摘要:本文针对多无人机集群的低空智能网络,提出基于区块链的零信任架构和多智能体强化学习路由算法,以增强路由安全性并最小化端到端时延,实验表明该机制比基准方案平均降低22.38%的总时延。
English summary: This paper proposes a blockchain-enabled zero-trust architecture and multi-agent reinforcement learning algorithm to enhance routing security and minimize end-to-end delay in low-altitude intelligent networks with multiple UAV clusters, achieving 22.38% average delay reduction compared to benchmarks.
Authors:Matthias Bentert, Fedor V. Fomin, Petr A. Golovach, Laure Morelle
Abstract:
We investigate the problem of constructing fault-tolerant bases in matroids. Given a matroid M and a redundancy parameter k, a k-fault-tolerant basis is a minimum-size set of elements such that, even after the removal of any k elements, the remaining subset still spans the entire ground set. Since matroids generalize linear independence across structures such as vector spaces, graphs, and set systems, this problem unifies and extends several fault-tolerant concepts appearing in prior research.
Our main contribution is a fixed-parameter tractable (FPT) algorithm for the k-fault-tolerant basis problem, parameterized by both k and the rank r of the matroid. This two-variable parameterization by k + r is shown to be tight in the following sense. On the one hand, the problem is already NP-hard for k=1. On the other hand, it is Para-NP-hard for r \geq 3 and polynomial-time solvable for r \leq 2.
中文: 本文针对拟阵中构建k容错基问题,提出了以冗余度k和秩r为参数的双变量固定参数可解算法,并通过NP困难性结果证明了该参数化方案的紧致性。
English: This paper presents a fixed-parameter tractable algorithm for constructing k-fault-tolerant bases in matroids, parameterized by both redundancy k and matroid rank r, while demonstrating tight computational boundaries through NP-hardness results.
Authors:Meng Yu, Te Cui, Qitong Chu, Wenjie Song, Yi Yang, Yufeng Yue
Abstract:
Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-T semantic segmentation models mainly rely on low-level visual features and lack high-level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text-aware RGB-T segmentation framework by using Low-Rank Adaptation (LoRA) fine-tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM's original transformer blocks. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.
中文: TASeg提出了一种文本感知的RGB-T分割框架,通过LoRA微调技术和动态特征融合模块,有效整合多模态视觉特征与CLIP文本嵌入,在复杂场景中以更少参数实现卓越性能。
English: TASeg introduces a text-aware RGB-T segmentation framework that leverages LoRA fine-tuning and a Dynamic Feature Fusion Module to effectively integrate multi-modal visual features and CLIP text embeddings, achieving superior performance with fewer parameters in challenging scenarios.
Authors:Junze Chen, Cheng Yang, Shujie Li, Zhiqiang Zhang, Yawen Li, Junping Du, Chuan Shi
Abstract:
Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of GLMs utilizes abundant training labels to enhance model performance, known as instruction tuning. However, we argue that ICL on graphs has effectiveness issues due to fixed parameters and efficiency issues due to long context. Meanwhile, the large amount of labeled data required for instruction tuning can be difficult to obtain in real-world scenarios. To this end, we aim to introduce an extra parameter adaptation stage that can efficiently tailor GLMs to an unseen graph and task with only a few labeled examples, in exchange for better prediction accuracy and faster inference speed. For implementation, in this paper we propose GraphLAMA method, with its model backbone and learning schemes specialized for efficient tuning and inference. Specifically, for model backbone, we use a graph neural network (GNN) with several well-designed components to transform nodes into the representation space of LLM tokens. Task instructions can then be represented as a mixture of node and language tokens. In the pre-training stage, model parameters except the LLM will be trained with different tasks to capture general knowledge. In the adaptation stage, only a few pre-trained parameters will be updated based on few-shot examples. Extensive experiments on few/zero-shot node classification and summary generation show that our proposed GraphLAMA achieves state-of-the-art performance with 4.91% absolution improvement in accuracy. Compared with ICL, our inference speed can be 10 times faster under 5-shot setting.
中文: 大语言模型正被应用于图分析领域,但面临上下文学习效果与效率不足以及指令调优数据需求大的问题,因此提出GraphLAMA方法,通过参数自适应实现在少量标注样本下获得更高预测精度和更快推理速度。
English: Large language models are being adapted for graph analysis as graph language models, but face challenges with in-context learning's effectiveness and efficiency, and instruction tuning's data requirements, leading to the proposed GraphLAMA method that introduces parameter adaptation for better accuracy and faster inference with few labeled examples.
Authors:Fedor V. Fomin, Petr A. Golovach, Danil Sagunov, Kirill Simonov
Abstract:
Covering and partitioning the edges of a graph into cliques are classical problems at the intersection of combinatorial optimization and graph theory, having been studied through a range of algorithmic and complexity-theoretic lenses. Despite the well-known fixed-parameter tractability of these problems when parameterized by the total number of cliques, such a parameterization often fails to be meaningful for sparse graphs. In many real-world instances, on the other hand, the minimum number of cliques in an edge cover or partition can be very close to the size of a maximum independent set α(G).
Motivated by this observation, we investigate above αparameterizations of the edge clique cover and partition problems. Concretely, we introduce and study Edge Clique Cover Above Independent Set (ECC/α) and Edge Clique Partition Above Independent Set (ECP/α), where the goal is to cover or partition all edges of a graph using at most α(G) + k cliques, and k is the parameter. Our main results reveal a distinct complexity landscape for the two variants. We show that ECP/αis fixed-parameter tractable, whereas ECC/αis NP-complete for all k \geq 2, yet can be solved in polynomial time for k \in {0,1}. These findings highlight intriguing differences between the two problems when viewed through the lens of parameterization above a natural lower bound.
Finally, we demonstrate that ECC/αbecomes fixed-parameter tractable when parameterized by k + Ï(G), where Ï(G) is the size of a maximum clique of the graph G. This result is particularly relevant for sparse graphs, in which Ïis typically small. For H-minor free graphs, we design a subexponential algorithm of running time f(H)^{\sqrt{k}}n^{O(1)}.
中文摘要:本文研究了基于最大独立集规模参数化的边团覆盖与划分问题,发现边团划分具有固定参数可解性,而边团覆盖在多数情况下是NP完全问题,但结合最大团规模参数时可实现固定参数可解。
English Summary: This paper investigates parameterized complexity of edge clique cover and partition problems above the maximum independent set size, revealing that edge clique partition is fixed-parameter tractable while edge clique cover is generally NP-complete but becomes tractable with additional parameters.
Authors:Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji
Abstract:
As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
This paper introduces OracleFusion, a two-stage semantic typography framework that uses a multimodal large language model with enhanced spatial reasoning and structural vector fusion to generate visually enriched, semantically accurate vector fonts for deciphering Oracle Bone Script, outperforming existing models in both functionality and aesthetics.
English Summary:
Authors:Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji
Abstract:
Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.
中文: 提出的DALR方法通过双重对齐策略解决多模态句子表征中的跨模态错位和模态内语义偏差问题,结合软化负样本的一致性学习和排序蒸馏技术,在多项基准任务中展现出优越性能。
English: The proposed DALR method addresses cross-modal misalignment and intra-modal semantic divergence in multimodal sentence representation by introducing a dual-level alignment approach that combines softened negative sample consistency learning with ranking distillation, achieving superior results on benchmark tasks.
Authors:Tianxing Zhou, Zhirui Wang, Haojia Ao, Guangyan Chen, Boyang Xing, Jingwen Cheng, Yi Yang, Yufeng Yue
Abstract:
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
Chinese: STEP框架通过子目标分解与终止模型构建分层任务树,显著提升机器人在复杂长期任务中的完成成功率,优于现有先进方法。
English: The STEP framework enhances robotic task planning by using a hierarchical subgoal tree with decomposition and termination models, significantly improving long-horizon task success rates over existing methods.
Authors:Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.
中文: 本文基于量子黑客马拉松的实际挑战对大型语言模型进行PennyLane量子代码生成的基准测试,发现检索增强生成与标准提示效果相当,同时提出了能提高解决方案成功率的多智能体评估流程。
English: This paper benchmarks large language models for generating PennyLane quantum code using real-world challenges from QHack, finding that retrieval-augmented generation performs comparably to standard prompting while introducing an evaluation pipeline that improves solution success rates.
Authors:Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.
中文: 本文基于量子黑客马拉松的实际挑战对大型语言模型进行PennyLane量子代码生成的基准测试,发现检索增强生成与标准提示效果相当,同时提出了能提高解决方案成功率的多智能体评估流程。
English: This paper benchmarks large language models for generating PennyLane quantum code using real-world challenges from QHack, finding that retrieval-augmented generation performs comparably to standard prompting while introducing an evaluation pipeline that improves solution success rates.
Authors:Shulan Ruan, Rongwei Wang, Xuchen Shen, Huijie Liu, Baihui Xiao, Jun Shi, Kun Zhang, Zhenya Huang, Yu Liu, Enhong Chen, You He
Abstract:
Multi-sensor fusion perception (MSFP) is a key technology for embodied AI, which can serve a variety of downstream tasks (e.g., 3D object detection and semantic segmentation) and application scenarios (e.g., autonomous driving and swarm robotics). Recently, impressive achievements on AI-based MSFP methods have been reviewed in relevant surveys. However, we observe that the existing surveys have some limitations after a rigorous and detailed investigation. For one thing, most surveys are oriented to a single task or research field, such as 3D object detection or autonomous driving. Therefore, researchers in other related tasks often find it difficult to benefit directly. For another, most surveys only introduce MSFP from a single perspective of multi-modal fusion, while lacking consideration of the diversity of MSFP methods, such as multi-view fusion and time-series fusion. To this end, in this paper, we hope to organize MSFP research from a task-agnostic perspective, where methods are reported from various technical views. Specifically, we first introduce the background of MSFP. Next, we review multi-modal and multi-agent fusion methods. A step further, time-series fusion methods are analyzed. In the era of LLM, we also investigate multimodal LLM fusion methods. Finally, we discuss open challenges and future directions for MSFP. We hope this survey can help researchers understand the important progress in MSFP and provide possible insights for future research.
中文摘要:多传感器融合感知作为具身智能的关键技术应用广泛,但现有综述因局限于单一任务领域和技术视角而存在不足,本文由此从任务无关角度系统梳理了多模态、多智能体、时序及大语言模型融合等多维技术方法。
English Summary: Multi-sensor fusion perception is a crucial embodied AI technology with broad applications, yet existing surveys are limited by task-specific focus and narrow technical perspectives, prompting this comprehensive review that organizes methods across multiple dimensions including multimodal, multi-agent, temporal, and LLM-based fusion.
Authors:Yue Zhou, Yuan Bi, Wenjuan Tong, Wei Wang, Nassir Navab, Zhongliang Jiang
Abstract:
Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.
中文:UltraAD是一种基于视觉语言模型的方法,利用少量超声样本通过融合图像与文本特征提升异常定位能力,并借助存储文本和图像嵌入的记忆库实现细粒度分类,在乳腺超声数据集的异常定位和分类任务中均优于现有先进方法。
English: UltraAD is a vision-language model that uses few-shot ultrasound examples to improve anomaly localization through fused image-text features and enables fine-grained classification via a memory bank of text and image embeddings, outperforming existing methods on breast ultrasound datasets.
Authors:Xuesong Li, Dianye Huang, Yameng Zhang, Nassir Navab, Zhongliang Jiang
Abstract:
Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.
中文: 本研究提出了一种基于变压器和大型语言模型的场景图方法,旨在提高非专业用户对超声图像的理解能力并提供扫描指导,通过在颈部区域的验证展示了该方法在普及超声应用方面的潜力。
English: This study introduces a scene graph method using transformers and large language models to enhance ultrasound image interpretability and provide scanning guidance for non-experts, validated on neck region images to democratize ultrasound usage.
Authors:Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu
Abstract:
Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models (LLMs). Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. We observe that quantizing the KV cache of different tokens has varying impacts on the quality of attention outputs. To systematically investigate this phenomenon, we perform forward error propagation analysis on attention and propose the Anchor Score (AnS) that quantifies the sensitivity of each token's KV cache to quantization-induced error. Our analysis reveals significant disparities in AnS across tokens, suggesting that preserving a small subset with full precision (FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive quantization scenarios. Based on this insight, we introduce AnTKV, a novel framework that leverages Anchor Token-aware Vector Quantization to compress the KV cache. Furthermore, to support efficient deployment, we design and develop a triton kernel that is fully compatible with FlashAttention, enabling fast online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x higher decoding throughput compared to the FP16 baseline. Our experiment results demonstrate that AnTKV matches or outperforms prior works such as KIVI, SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves significantly lower perplexity under ultra-low-bit quantization on Mistral-7B, with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of 4.73.
中文: 量化虽能有效减少大语言模型中KV缓存的存储占用,但如何降低其导致的性能损失仍是挑战,为此提出的AnTKV框架通过选择性保留高敏感度令牌的全精度,在实现显著内存压缩和更高吞吐量的同时保持了模型精度。
English: Quantization effectively reduces KV cache memory in LLMs, but minimizing performance loss remains challenging, leading to the development of AnTKV, a framework that selectively preserves high-sensitivity tokens in full precision to maintain accuracy while enabling significant memory savings and higher throughput.
Authors:Yuchang Zhu, Huazhen Zhong, Qunshu Lin, Haotong Wei, Xiaolong Sun, Zixuan Yu, Minghao Liu, Zibin Zheng, Liang Chen
Abstract:
With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.
中文摘要:在标注数据不足时,适度多样的大语言模型生成数据能提升下游模型性能,但过高多样性会产生负面影响,这为LLM作为数据生成器的应用提供了关键指导。
English Summary: Using moderately diverse LLM-generated data can improve model performance when labeled data is scarce, but excessive diversity negatively impacts results, highlighting the importance of balanced data composition for effective training.
Authors:Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan
Abstract:
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
中文: 4D-LRM是首个大规模四维重建模型,能从任意视角和时间点输入,通过统一时空表征渲染出任意新视角和时间的组合,实现高效高质量的四维重建。
English: 4D-LRM is the first large-scale model that reconstructs objects from sparse views and timestamps to render any view-time combination, achieving fast and high-quality 4D reconstruction through unified space-time representation.
Authors:Xinyao Li, Jingjing Li, Fengling Li, Lei Zhu, Yang Yang, Heng Tao Shen
Abstract:
Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.
中文: 视觉语言预训练技术虽在多模态模型中展现出强大的零样本能力,但在专业任务中表现欠佳,因此研究聚焦于泛化方法及下游应用的基准测试与性能优化。
English: Vision-language pretraining has advanced multimodal models with strong zero-shot abilities, but their performance declines in specialized tasks, prompting research into generalization methods and benchmarking for downstream applications.
Authors:Jinjie Wei, Jiyao Liu, Lihao Liu, Ming Hu, Junzhi Ning, Mingcheng Li, Weijie Yin, Junjun He, Xiao Liang, Chao Feng, Dingkang Yang
Abstract:
Graphical User Interface (GUI) agents have made significant progress in automating digital tasks through the utilization of computer vision and language models. Nevertheless, existing agent systems encounter notable limitations. Firstly, they predominantly depend on trial and error decision making rather than progressive reasoning, thereby lacking the capability to learn and adapt from interactive encounters. Secondly, these systems are assessed using overly simplistic single step accuracy metrics, which do not adequately reflect the intricate nature of real world GUI interactions. In this paper, we present CogniGUI, a cognitive framework developed to overcome these limitations by enabling adaptive learning for GUI automation resembling human-like behavior. Inspired by Kahneman's Dual Process Theory, our approach combines two main components: (1) an omni parser engine that conducts immediate hierarchical parsing of GUI elements through quick visual semantic analysis to identify actionable components, and (2) a Group based Relative Policy Optimization (GRPO) grounding agent that assesses multiple interaction paths using a unique relative reward system, promoting minimal and efficient operational routes. This dual-system design facilitates iterative ''exploration learning mastery'' cycles, enabling the agent to enhance its strategies over time based on accumulated experience. Moreover, to assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence, which are often overlooked challenges in current benchmarks. Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
中文: 摘要介绍了CogniGUI认知框架,它通过模拟人类推理的双组件设计实现自适应学习以克服GUI代理的局限,并提出了ScreenSeek新基准来评估系统的泛化与适应能力。
English: The abstract introduces CogniGUI, a cognitive framework that overcomes limitations in GUI agents by enabling adaptive learning through dual components inspired by human reasoning, and presents ScreenSeek, a new benchmark for evaluating generalization and adaptability.
Authors:Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo, Haofen Wang, Huajun Chen
Abstract:
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn interactive thinking and deep reasoning framework powered by a dedicated parameter-light large language model (LLM). Our approach constructs a structured thinking process for solving complex problems, enhancing the the logical coherence and contextual consistency of the reasoning process in question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning technology route of KAG, this framework first decomposes complex questions into independently solvable sub-problems (which are also referred to as logical forms) through \textbf{breadth decomposition}. Each such logical form is represented in two equivalent forms-natural language and logical function-and subsequently classified as either a Knowledge Retrieval or Reasoning Analysis task. Dependencies and parameter passing between these tasks are explicitly modeled via logical function interfaces. In the solving process, the Retrieval function performs retrieval tasks. It retrieves one-hop structured and unstructured information of specified knowledge unit. While the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge boundary} module to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbf{depth solving} module to enhance the comprehensiveness of knowledge acquisition...
中文: 本文提出KAG-Thinker框架,通过将复杂问题分解为结构化子问题,并利用轻量化大语言模型进行多轮交互式思考与深度推理,提升领域知识问答中的逻辑连贯性。
English: This paper presents KAG-Thinker, a framework that enhances multi-turn interactive thinking and deep reasoning by decomposing complex questions into structured sub-problems and leveraging parameter-light LLMs for logical coherence in domain-specific Q&A tasks.
Authors:Yuzhe Ding, Kang He, Bobo Li, Li Zheng, Haijun He, Fei Li, Chong Teng, Donghong Ji
Abstract:
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the increasing number of online debates among social media users, conversational stance detection has become a crucial research area. However, existing conversational stance detection datasets are restricted to a limited set of specific targets, which constrains the effectiveness of stance detection models when encountering a large number of unseen targets in real-world applications. To bridge this gap, we manually curate a large-scale, high-quality zero-shot conversational stance detection dataset, named ZS-CSD, comprising 280 targets across two distinct target types. Leveraging the ZS-CSD dataset, we propose SITPCL, a speaker interaction and target-aware prototypical contrastive learning model, and establish the benchmark performance in the zero-shot setting. Experimental results demonstrate that our proposed SITPCL model achieves state-of-the-art performance in zero-shot conversational stance detection. Notably, the SITPCL model attains only an F1-macro score of 43.81%, highlighting the persistent challenges in zero-shot conversational stance detection.
中文摘要:本研究提出了大规模零样本对话立场检测数据集ZS-CSD和SITPCL模型,该模型取得了最先进性能,但其43.81%的F1宏平均值仍凸显了该领域存在的持续挑战。
English Summary: This study introduces ZS-CSD, a large-scale zero-shot conversational stance detection dataset, and proposes the SITPCL model which achieves state-of-the-art performance while highlighting remaining challenges with its 43.81% F1-macro score.
Authors:Junze Chen, Xinjie Yang, Cheng Yang, Junfei Bao, Zeyuan Guo, Yawen Li, Chuan Shi
Abstract:
Recommender systems (RSs) are designed to retrieve candidate items a user might be interested in from a large pool. A common approach is using graph neural networks (GNNs) to capture high-order interaction relationships. As large language models (LLMs) have shown strong capabilities across domains, researchers are exploring their use to enhance recommendation. However, prior work limits LLMs to re-ranking results or dataset augmentation, failing to utilize their power during candidate filtering - which may lead to suboptimal performance. Instead, we propose to leverage LLMs' reasoning abilities during the candidate filtering process, and introduce Chain Of Retrieval ON grAphs (CORONA) to progressively narrow down the range of candidate items on interaction graphs with the help of LLMs: (1) First, LLM performs preference reasoning based on user profiles, with the response serving as a query to extract relevant users and items from the interaction graph as preference-assisted retrieval; (2) Then, using the information retrieved in the previous step along with the purchase history of target user, LLM conducts intent reasoning to help refine an even smaller interaction subgraph as intent-assisted retrieval; (3) Finally, we employ a GNN to capture high-order collaborative filtering information from the extracted subgraph, performing GNN-enhanced retrieval to generate the final recommendation results. The proposed framework leverages the reasoning capabilities of LLMs during the retrieval process, while seamlessly integrating GNNs to enhance overall recommendation performance. Extensive experiments on various datasets and settings demonstrate that our proposed CORONA achieves state-of-the-art performance with an 18.6% relative improvement in recall and an 18.4% relative improvement in NDCG on average.
中文: 本文提出CORONA框架,通过在大语言模型指导下进行偏好推理和意图推理来逐步缩小交互图中的候选物品范围,然后利用图神经网络进行增强检索,最终实现推荐性能的显著提升,在召回率和NDCG指标上平均分别相对提高18.6%和18.4%。
English: This paper introduces CORONA, a novel framework that leverages large language models' reasoning abilities during candidate filtering by progressively narrowing down items on interaction graphs through preference and intent reasoning, then using GNNs for enhanced retrieval to achieve state-of-the-art recommendation performance with significant improvements in recall and NDCG.
Authors:Matthias Bentert, Fedor V. Fomin, Petr A. Golovach, Laure Morelle
Abstract:
In the problem Fault-Tolerant Path (FTP), we are given an edge-weighted directed graph G = (V, E), a subset U \subseteq E of vulnerable edges, two vertices s, t \in V, and integers k and \ell. The task is to decide whether there exists a subgraph H of G with total cost at most \ell such that, after the removal of any k vulnerable edges, H still contains an s-t-path. We study whether Fault-Tolerant Path is fixed-parameter tractable (FPT) and whether it admits a polynomial kernel under various parameterizations. Our choices of parameters include: the number of vulnerable edges in the input graph, the number of safe (i.e, invulnerable) edges in the input graph, the budget \ell, the minimum number of safe edges in any optimal solution, the minimum number of vulnerable edges in any optimal solution, the required redundancy k, and natural above- and below-guarantee parameterizations. We provide an almost complete description of the complexity landscape of FTP for these parameters.
中文: 容错路径问题研究在移除最多k条易损边后仍能保持两点间连通性的低成本子图是否存在,并通过多种参数化方法全面分析了该问题的固定参数可解性及核化复杂性。
English: The Fault-Tolerant Path problem examines whether a subgraph exists with limited cost that maintains connectivity between two vertices even after removing up to k vulnerable edges, and this study comprehensively analyzes its fixed-parameter tractability and kernelization complexity across multiple parameterizations.
Authors:Giuseppe Lando, Rosario Forte, Giovanni Maria Farinella, Antonino Furnari
Abstract:
We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.
Chinese: 研究表明,现成的多模态大语言模型无需额外训练即可通过将流式视频转换为轻量级文本记忆并利用大语言模型推理,实现在线情景记忆视频问答,在保持竞争力的准确率同时,其存储效率比专用系统高出数个数量级。
English: This study demonstrates that off-the-shelf multimodal large language models can effectively perform online episodic-memory video question answering by converting streaming videos into compact textual memories and leveraging LLM reasoning, achieving competitive accuracy with drastically superior memory efficiency compared to specialized systems.
Authors:Chenyi Zhou, Zhengyan Shi, Yuan Yao, Lei Liang, Huajun Chen, Qiang Zhang
Abstract:
Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.
中文: 提出的残差优化树(RiOT)框架通过生成多样化候选提示并采用文本残差连接缓解语义漂移,在多项推理基准测试中超越了现有提示优化方法。
English: The proposed Residual Optimization Tree (RiOT) framework addresses limitations in automatic prompt optimization by generating diverse candidates and mitigating semantic drift through text residual connections, outperforming existing methods across multiple reasoning benchmarks.
Authors:Xu Zhao, Chen Zhao, Xiantao Hu, Hongliang Zhang, Ying Tai, Jian Yang
Abstract:
Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequency and low-frequency noise. In this paper, we propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising. We use image pyramid inputs to restore noise-free results from low-resolution images. In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit (ASFU), where a learnable mask is used to separate the information into high-frequency and low-frequency components. In the skip connections, we design a global feature fusion block to enhance the features at different scales. Extensive experiments on both synthetic and real noisy image datasets verify the effectiveness of MADNet compared with current state-of-the-art denoising approaches.
中文: 本文提出MADNet多尺度自适应双域网络,通过采用图像金字塔输入、设计自适应空频学习单元分离高低频噪声,并结合全局特征融合模块,有效克服了现有去噪方法的局限,在合成与真实噪声数据集上均取得了领先性能。
English: This paper introduces MADNet, a multi-scale adaptive dual-domain network that addresses limitations in existing denoising architectures by incorporating image pyramid inputs, an adaptive spatial-frequency learning unit to separate high and low-frequency noise, and a global feature fusion block, achieving superior performance on synthetic and real noisy datasets.
Authors:Zhenxuan Zhang, Lipei Zhang, Yanqi Cheng, Zi Wang, Fanwen Wang, Haosen Zhang, Yue Yang, Yinzhe Wu, Jiahao Huang, Angelica I Aviles-Rivero, Zhifan Gao, Guang Yang, Peter J. Lally
Abstract:
In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains.
中文: 提出的PR-INR框架通过整合运动校正、结构优化和体积合成,逐步改进运动鲁棒性磁共振成像重建,有效克服欠采样和运动导致的层次结构破坏,在多种数据集上展现出卓越性能。
English: The proposed PR-INR framework progressively refines motion-robust MRI reconstruction by integrating motion correction, structural refinement, and volumetric synthesis to overcome hierarchical disruptions from undersampling and motion, achieving superior performance across diverse datasets.
Authors:Yao Zhang, Chenyang Lin, Shijie Tang, Haokun Chen, Shijie Zhou, Yunpu Ma, Volker Tresp
Abstract:
The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation. Our code is publicly released at https://yaoz720.github.io/SwarmAgentic/.
中文: SwarmAgentic是一个全自动框架,通过语言驱动探索和群体智能原理从头构建并优化智能体系统,在复杂任务中表现出卓越性能。
English: SwarmAgentic is a fully automated framework that generates and optimizes agentic systems from scratch using language-driven exploration and swarm intelligence principles, achieving superior performance in complex tasks.
Authors:Li Zheng, Sihang Wang, Hao Fei, Zuquan Peng, Fei Li, Jianming Fu, Chong Teng, Donghong Ji
Abstract:
Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.
Chinese: 提出的EmoBi框架通过整合情感分析和双向动态交互,有效提升了夸张和隐喻的检测能力,在多个数据集上相比现有方法实现了显著性能提升。
English: The proposed EmoBi framework enhances hyperbole and metaphor detection by integrating emotion analysis and bidirectional dynamic interaction, achieving significant performance improvements over existing methods on multiple datasets.
Authors:Miaoxin Pan, Jinnan Li, Yaowen Zhang, Yi Yang, Yufeng Yue
Abstract:
Object-level SLAM offers structured and semantically meaningful environment representations, making it more interpretable and suitable for high-level robotic tasks. However, most existing approaches rely on RGB-D sensors or monocular views, which suffer from narrow fields of view, occlusion sensitivity, and limited depth perception-especially in large-scale or outdoor environments. These limitations often restrict the system to observing only partial views of objects from limited perspectives, leading to inaccurate object modeling and unreliable data association. In this work, we propose MCOO-SLAM, a novel Multi-Camera Omnidirectional Object SLAM system that fully leverages surround-view camera configurations to achieve robust, consistent, and semantically enriched mapping in complex outdoor scenarios. Our approach integrates point features and object-level landmarks enhanced with open-vocabulary semantics. A semantic-geometric-temporal fusion strategy is introduced for robust object association across multiple views, leading to improved consistency and accurate object modeling, and an omnidirectional loop closure module is designed to enable viewpoint-invariant place recognition using scene-level descriptors. Furthermore, the constructed map is abstracted into a hierarchical 3D scene graph to support downstream reasoning tasks. Extensive experiments in real-world demonstrate that MCOO-SLAM achieves accurate localization and scalable object-level mapping with improved robustness to occlusion, pose variation, and environmental complexity.
中文摘要:MCOO-SLAM提出了一种多相机全景系统,通过融合环视相机与语义几何特征,在复杂户外场景中实现了抗遮挡的精确对象建模与视点无关的地图构建,有效解决了传统SLAM的视角受限问题。
English Summary: MCOO-SLAM introduces a multi-camera omnidirectional system that overcomes limitations of conventional SLAM by integrating surround-view cameras with semantic-geometric fusion for robust object modeling and viewpoint-invariant mapping in complex outdoor environments.
Authors:Jiahao You, Ziye Jia, Chao Dong, Qihui Wu, Zhu Han
Abstract:
The computation demands from the maritime Internet of Things (MIoT) increase rapidly in recent years, and the unmanned aerial vehicles (UAVs) and vessels based multi-access edge computing (MEC) can fulfill these MIoT requirements. However, the uncertain maritime tasks present significant challenges of inefficient computation offloading and resource allocation. In this paper, we focus on the maritime computation offloading and resource allocation through the cooperation of UAVs and vessels, with consideration of uncertain tasks. Specifically, we propose a cooperative MEC framework for computation offloading and resource allocation, including MIoT devices, UAVs and vessels. Then, we formulate the optimization problem to minimize the total execution time. As for the uncertain MIoT tasks, we leverage Lyapunov optimization to tackle the unpredictable task arrivals and varying computational resource availability.
By converting the long-term constraints into short-term constraints, we obtain a set of small-scale optimization problems. Further, considering the heterogeneity of actions and resources of UAVs and vessels, we reformulate the small-scale optimization problem into a Markov game (MG). Moreover, a heterogeneous-agent soft actor-critic is proposed to sequentially update various neural networks and effectively solve the MG problem.
Finally, simulations are conducted to verify the effectiveness in addressing computational offloading and resource allocation.
中文摘要:本文提出了一种基于无人机与船舶协同的海事边缘计算框架,通过李雅普诺夫优化处理任务不确定性,并采用异构智能体强化学习方法有效解决海事计算卸载与资源分配问题。
English Summary: This paper proposes a cooperative multi-access edge computing framework using UAVs and vessels to optimize computation offloading and resource allocation for maritime IoT tasks, addressing uncertainty through Lyapunov optimization and solving the problem via a heterogeneous-agent reinforcement learning approach.
Authors:Yao Zhang, Hewei Gao, Haokun Chen, Weiguo Li, Yunpu Ma, Volker Tresp
Abstract:
Multimodal Large Language Models (MLLMs) excel in tasks like multimodal reasoning and cross-modal retrieval but face deployment challenges in real-world scenarios due to distributed multimodal data and strict privacy requirements. Federated Learning (FL) offers a solution by enabling collaborative model training without centralizing data. However, realizing FL for MLLMs presents significant challenges, including high computational demands, limited client capacity, substantial communication costs, and heterogeneous client data. Existing FL methods assume client-side deployment of full models, an assumption that breaks down for large-scale MLLMs due to their massive size and communication demands. To address these limitations, we propose FedNano, the first FL framework that centralizes the LLM on the server while introducing NanoEdge, a lightweight module for client-specific adaptation. NanoEdge employs modality-specific encoders, connectors, and trainable NanoAdapters with low-rank adaptation. This design eliminates the need to deploy LLM on clients, reducing client-side storage by 95%, and limiting communication overhead to only 0.01% of the model parameters. By transmitting only compact NanoAdapter updates, FedNano handles heterogeneous client data and resource constraints while preserving privacy. Experiments demonstrate that FedNano outperforms prior FL baselines, bridging the gap between MLLM scale and FL feasibility, and enabling scalable, decentralized multimodal AI systems.
中文: FedNano是一种联邦学习框架,将大语言模型集中在服务器端,客户端使用轻量级NanoEdge模块处理多模态数据,显著降低存储和通信开销,同时保持隐私和性能。
English: FedNano is a federated learning framework that centralizes the large language model on the server and uses lightweight NanoEdge modules on clients to handle multimodal data, significantly reducing storage and communication costs while maintaining privacy and performance.
Authors:Kunyuan Deng, Yi Wang, Lap-Pui Chau
Abstract:
Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective. However, progress has been hindered by the lack of benchmarks and methods tailored to egocentric challenges such as severe hand-object occlusion. In this paper, we introduce the real-world Ego-HOI detection task and the accompanying Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations across 123 categories. Ego-HOIBench covers diverse daily scenarios, object types, and both single- and two-hand interactions, offering a comprehensive testbed for Ego-HOI research. Benchmarking existing third-person HOI detectors on Ego-HOIBench reveals significant performance gaps, highlighting the need for egocentric-specific solutions. To this end, we propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations. Specifically, HGIR explicitly extracts global hand geometric features from the estimated hand pose proposals, and further refines interaction features through pose-interaction attention, enabling the model to focus on subtle hand-object relationship differences even under severe occlusion. HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on Ego-HOIBench. Our dataset and method establish a solid foundation for future research in egocentric vision and human-object interaction understanding. Project page: https://dengkunyuan.github.io/EgoHOIBench/
中文: 本文提出了针对自我中心视角下人与物体交互检测的新数据集Ego-HOIBench,并开发了HGIR方法,通过手部几何特征增强交互表征,在严重遮挡情况下仍能实现最优性能。
English: This paper introduces Ego-HOIBench, a comprehensive dataset for egocentric human-object interaction detection addressing challenges like occlusion, and proposes HGIR, a plug-and-play method that enhances interaction features using hand geometry to achieve state-of-the-art performance.
Authors:Yixian Xu, Shengjie Luo, Liwei Wang, Di He, Chang Liu
Abstract:
Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.
Chinese Summary: 扩散模型的实际损失值无法直接反映数据拟合质量,因其最优值非零且未知,因此需要估计最优损失以诊断训练效果并提升模型性能。
English Summary: Diffusion models require estimating their optimal loss value to accurately diagnose training quality and enhance performance, as the actual loss does not directly reflect data-fitting effectiveness due to its non-zero optimum.
Authors:Guorui Zhou, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Qiang Luo, Qianqian Wang, Qigen Hu, Rui Huang, Shiyao Wang, Weifeng Ding, Wuchao Li, Xinchen Luo, Xingmei Wang, Zexuan Cheng, Zixing Zhang, Bin Zhang, Boxuan Wang, Chaoyi Ma, Chengru Song, Chenhui Wang, Di Wang, Dongxue Meng, Fan Yang, Fangyu Zhang, Feng Jiang, Fuxing Zhang, Gang Wang, Guowang Zhang, Han Li, Hengrui Hu, Hezheng Lin, Hongtao Cheng, Hongyang Cao, Huanjie Wang, Jiaming Huang, Jiapeng Chen, Jiaqiang Liu, Jinghui Jia, Kun Gai, Lantao Hu, Liang Zeng, Liao Yu, Qiang Wang, Qidong Zhou, Shengzhe Wang, Shihui He, Shuang Yang, Shujie Yang, Sui Huang, Tao Wu, Tiantian He, Tingting Gao, Wei Yuan, Xiao Liang, Xiaoxiao Xu, Xugang Liu, Yan Wang, Yi Wang, Yiwu Liu, Yue Song, Yufei Zhang, Yunfan Wu, Yunfeng Zhao, Zhanyu Liu
Abstract:
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios. To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.
中文: OneRec系统通过端到端的生成式方法革新了推荐系统架构,有效解决了传统多级流水线的计算碎片化问题,在实际部署中显著提升了计算效率、强化学习应用效果及用户参与度指标。
English: The proposed OneRec system introduces an end-to-end generative approach to overcome the limitations of traditional multi-stage recommender systems, achieving significant performance improvements, including enhanced computational efficiency, better application of reinforcement learning, and notable gains in user engagement metrics when deployed in real-world platforms.
Authors:Alsharif Abuadbba, Chris Hicks, Kristen Moore, Vasilios Mavroudis, Burak Hasircioglu, Diksha Goel, Piers Jennings
Abstract:
Large Language Models (LLMs) are set to reshape cybersecurity by augmenting red and blue team operations. Red teams can exploit LLMs to plan attacks, craft phishing content, simulate adversaries, and generate exploit code. Conversely, blue teams may deploy them for threat intelligence synthesis, root cause analysis, and streamlined documentation. This dual capability introduces both transformative potential and serious risks.
This position paper maps LLM applications across cybersecurity frameworks such as MITRE ATT&CK and the NIST Cybersecurity Framework (CSF), offering a structured view of their current utility and limitations. While LLMs demonstrate fluency and versatility across various tasks, they remain fragile in high-stakes, context-heavy environments. Key limitations include hallucinations, limited context retention, poor reasoning, and sensitivity to prompts, which undermine their reliability in operational settings.
Moreover, real-world integration raises concerns around dual-use risks, adversarial misuse, and diminished human oversight. Malicious actors could exploit LLMs to automate reconnaissance, obscure attack vectors, and lower the technical threshold for executing sophisticated attacks.
To ensure safer adoption, we recommend maintaining human-in-the-loop oversight, enhancing model explainability, integrating privacy-preserving mechanisms, and building systems robust to adversarial exploitation. As organizations increasingly adopt AI driven cybersecurity, a nuanced understanding of LLMs' risks and operational impacts is critical to securing their defensive value while mitigating unintended consequences.
中文: 大型语言模型能够增强网络安全的攻防能力,但也存在误判和双重用途风险,需通过人工监督和防护机制确保安全应用。
English: Large Language Models (LLMs) can enhance both offensive and defensive cybersecurity operations but pose risks like hallucinations and dual-use threats, requiring human oversight and robust safeguards for secure adoption.
Authors:Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, Benyou Wang
Abstract:
Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.
中文摘要:本文提出的无问题微调(QFFT)方法使模型能自适应地结合简短和长链思维推理模式,在多种数学任务中实现响应长度减少超50%的同时保持与监督微调相当的性能,并在噪声环境、跨领域及低资源场景中表现更优。
English Summary: This paper introduces Question-Free Fine-Tuning (QFFT), a method that enables models to adaptively use both concise Short CoT reasoning and detailed Long CoT reasoning, significantly reducing response length by over 50% while maintaining performance comparable to supervised fine-tuning across various mathematical tasks.
Authors:Boran Wang, Ziye Jia, Can Cui, Qihui Wu
Abstract:
With the development of low earth orbit (LEO) satellites and unmanned aerial vehicles (UAVs), the space-air-ground integrated network (SAGIN) becomes a major trend in the next-generation networks. However, due to the instability of heterogeneous communication and time-varying characteristics of SAGIN, it is challenging to meet the remote Internet of Things (IoT) demands for data collection and offloading. In this paper, we investigate a two-phase hierarchical data uplink model in SAGIN. Specifically, UAVs optimize trajectories to enable efficient data collection from IoT devices, and then they transmit the data to LEO satellites with computing capabilities for further processing. The problem is formulated to minimize the total energy consumption for IoT devices, UAVs, and LEO satellites. Since the problem is in the form of mixed-integer nonlinear programming and intractable to solve directly, we decompose it into two phases. In the IoT-UAV phase, we design the algorithm to jointly optimize the IoT pairing, power allocation, and UAVs trajectories. Considering the high dynamic characteristics of LEO satellites, a real-time LEO satellite selection mechanism joint with the Satellite Tool Kit is proposed in the UAV-LEO phase. Finally, simulation results show the effectiveness of the proposed algorithms, with about 10% less energy consumption compared with the benchmark algorithm.
中文: 本文提出了一种空天地一体化网络中的两阶段分层数据上行模型,通过优化无人机轨迹和卫星选择来最小化物联网数据采集与卸载的能耗,相比基准算法实现了约10%的能耗降低。
English: This paper proposes a two-phase hierarchical data uplink model in the space-air-ground integrated network (SAGIN), optimizing UAV trajectories and satellite selection to minimize energy consumption for IoT data collection and offloading, achieving about 10% energy savings compared to benchmarks.
Authors:Xiaotian Zhang, Yuan Wang, Zhaopeng Feng, Ruizhe Chen, Zhijie Zhou, Yan Zhang, Hongxia Xu, Jian Wu, Zuozhu Liu
Abstract:
Medical Question-Answering (QA) encompasses a broad spectrum of tasks, including multiple choice questions (MCQ), open-ended text generation, and complex computational reasoning. Despite this variety, a unified framework for delivering high-quality medical QA has yet to emerge. Although recent progress in reasoning-augmented large language models (LLMs) has shown promise, their ability to achieve comprehensive medical understanding is still largely unexplored. In this paper, we present Med-U1, a unified framework for robust reasoning across medical QA tasks with diverse output formats, ranging from MCQs to complex generation and computation tasks. Med-U1 employs pure large-scale reinforcement learning with mixed rule-based binary reward functions, incorporating a length penalty to manage output verbosity. With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains. Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks, surpassing even larger specialized and proprietary models. Furthermore, Med-U1 demonstrates robust generalization to out-of-distribution (OOD) tasks. Extensive analysis presents insights into training strategies, reasoning chain length control, and reward design for medical LLMs. Our code is available here.
中文:Med-U1是一个采用强化学习和规则奖励的统一框架,能提升大语言模型在各类医学问答任务中的推理能力,在多项基准测试中表现卓越并具备强大的泛化能力。
English: Med-U1 is a unified framework using reinforcement learning with rule-based rewards to enhance large language models' reasoning across diverse medical question-answering tasks, achieving superior performance on benchmarks and robust generalization.
Authors:Tao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu
Abstract:
Accurate recognition of dysarthric and elderly speech remains challenging to date. While privacy concerns have driven a shift from centralized approaches to federated learning (FL) to ensure data confidentiality, this further exacerbates the challenges of data scarcity, imbalanced data distribution and speaker heterogeneity. To this end, this paper conducts a systematic investigation of regularized FL techniques for privacy-preserving dysarthric and elderly speech recognition, addressing different levels of the FL process by 1) parameter-based, 2) embedding-based and 3) novel loss-based regularization. Experiments on the benchmark UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest that regularized FL systems consistently outperform the baseline FedAvg system by statistically significant WER reductions of up to 0.55\% absolute (2.13\% relative). Further increasing communication frequency to one exchange per batch approaches centralized training performance.
中文: 本文系统研究了正则化联邦学习技术,通过参数、嵌入和新型损失正则化方法提升隐私保护的构音障碍与老年语音识别性能,相比基线系统取得了显著改进。
English: This paper systematically investigates regularized federated learning techniques to enhance privacy-preserving dysarthric and elderly speech recognition, demonstrating significant improvements over baseline systems through parameter, embedding, and novel loss-based regularization methods.
Authors:Yi Zhang, Yi Wang, Yawen Cui, Lap-Pui Chau
Abstract:
This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model's comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: https://cindy0725.github.io/3DGeoDet/.
中文: 本文提出3DGeoDet,一种几何感知的3D物体检测方法,通过预测深度生成显式和隐式3D表征来增强几何理解,在多个室内外数据集上实现了最优性能。
English: This paper introduces 3DGeoDet, a geometry-aware 3D object detection method that enhances 3D understanding through explicit and implicit representations using predicted depth, achieving state-of-the-art results across diverse indoor and outdoor datasets.
Authors:Xiaowen Ma, Chenyang Lin, Yao Zhang, Volker Tresp, Yunpu Ma
Abstract:
Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative "team" focused on a specific subtask. Agentic Neural Network follows a two-phase optimization strategy: (1) Forward Phase-Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase-Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables ANN to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data-driven framework for multi-agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open-source the entire framework.
中文摘要:Agentic Neural Network (ANN) 提出了一种动态的神经符号框架,将多智能体协作建模为分层神经网络,通过前向任务分解和后向迭代优化提升适应性与准确性,在多个基准测试中超越了现有最优多智能体系统。
English Summary: The Agentic Neural Network (ANN) introduces a dynamic, neuro-symbolic framework that models multi-agent collaboration as a layered neural network, employing forward task decomposition and backward iterative refinement to enhance adaptability and accuracy, outperforming existing multi-agent systems across benchmarks.
Authors:Yian Zhu, Ziye Jia, Lei Zhang, Yao Wu, Qiuming Zhu, Qihui Wu
Abstract:
The remote identification (Remote ID) broadcast capability allows unmanned aerial vehicles (UAVs) to exchange messages, which is a pivotal technology for inter-UAV communications. Although this capability enhances the operational visibility, low delay in Remote ID-based communications is critical for ensuring the efficiency and timeliness of multi-UAV operations in dynamic environments. To address this challenge, we first establish delay models for Remote ID communications by considering packet reception and collisions across both BLE 4 and Wi-Fi protocols. Building upon these models, we formulate an optimization problem to minimize the long-term communication delay through adaptive protocol selection. Since the delay performance varies with the UAV density, we propose an adaptive BLE/Wi-Fi switching algorithm based on the multi-agent deep Q-network approach. Experimental results demonstrate that in dynamic-density scenarios, our strategy achieves 32.1% and 37.7% lower latency compared to static BLE 4 and Wi-Fi modes respectively.
中文: 本研究提出基于多智能体深度Q网络的自适应蓝牙/Wi-Fi切换算法,在动态无人机密度场景中将远程识别通信延迟降低了超过30%。
English: The study develops adaptive BLE/Wi-Fi switching using multi-agent deep Q-networks to reduce Remote ID communication delays by over 30% in dynamic UAV environments.
Authors:Jintao Tong, Ran Ma, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li
Abstract:
Cross-domain few-shot segmentation (CD-FSS) is proposed to pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few samples are available for efficient fine-tuning. There are majorly two challenges in this task: (1) the domain gap and (2) fine-tuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn target-specific knowledge specific. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% MIoU in 1-shot and 5-shot scenarios, respectively.
中文: 跨域少样本分割通过提出基于结构的域特征导航器来解耦域信息,专注于领域无关知识,有效解决了域差异和数据稀缺问题,并在实验中显著超越了现有最优方法。
English: Cross-domain few-shot segmentation tackles domain gaps and limited data by introducing the Domain Feature Navigator, a structure-based decoupler that captures domain-specific information to focus on domain-agnostic knowledge, achieving state-of-the-art performance.
Authors:Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Abstract:
Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
Chinese: 针对编解码器生成的深度伪造语音,现有溯源方法在模拟与真实数据间泛化能力不足,为此提出的SASTNet融合语义与声学特征,在CodecFake+数据集上实现了最优性能。
English: Recent advances in source tracing for codec-based deepfake speech face challenges in generalizing from simulated to real data, prompting the development of SASTNet, which integrates semantic and acoustic features to achieve state-of-the-art performance on the CodecFake+ dataset.
Authors:Yuji Zhang, Qingyun Wang, Cheng Qian, Jiateng Liu, Chenkai Sun, Denghui Zhang, Tarek Abdelzaher, Chengxiang Zhai, Preslav Nakov, Heng Ji
Abstract:
Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o's chain-of-thought method, achieving state-of-the-art results with far less training data.
科学文本的权威性可能传播错误信息,尤其当非专业人士误解高密度数据表格时,而我们采用模块化推理组件的新方法,仅需少量训练数据就显著提升了验证准确性。
Scientific texts' authoritative tone can spread misinformation, especially when non-experts misinterpret data-dense tables, but our new method using modular reasoning components significantly improves verification accuracy with minimal training data.
Authors:Junzhe Wang, Bichen Wang, Xing Fu, Yixin Sun, Yanyan Zhao, Bing Qin
Abstract:
In recent years, Large Language Models (LLMs) have made significant progress in automated psychological counseling. However, current research focuses on single-session counseling, which doesn't represent real-world scenarios. In practice, psychological counseling is a process, not a one-time event, requiring sustained, multi-session engagement to progressively address clients' issues. To overcome this limitation, we introduce a dataset for Multi-Session Psychological Counseling Conversation Dataset (MusPsy-Dataset). Our MusPsy-Dataset is constructed using real client profiles from publicly available psychological case reports. It captures the dynamic arc of counseling, encompassing multiple progressive counseling conversations from the same client across different sessions. Leveraging our dataset, we also developed our MusPsy-Model, which aims to track client progress and adapt its counseling direction over time. Experiments show that our model performs better than baseline models across multiple sessions.
中文: 本研究基于真实来访者档案构建了多轮心理咨询数据集MusPsy-Dataset,并开发了能够追踪来访者进展、动态调整咨询策略的MusPsy-Model,该模型在多轮会话中的表现优于基线模型。
English: This study introduces the MusPsy-Dataset, a multi-session psychological counseling dataset based on real client profiles, and the MusPsy-Model, which effectively tracks client progress and adapts counseling strategies over multiple sessions, outperforming baseline models.
Authors:Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, Amrit Singh Bedi
Abstract:
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.
中文总结:推理模型的延长思考虽能短期提升表现,但会导致过度思考而降低精度;相比之下,通过多数投票选择多个独立推理路径的并行思考方法,能在相同计算成本下实现更高的准确率。
English Summary: Extended thinking in reasoning models initially boosts performance but leads to overthinking and declining accuracy, while parallel thinking with majority voting achieves significantly higher accuracy by generating multiple independent reasoning paths.
Authors:Jinghan Jia, Hadi Reisizadeh, Chongyu Fan, Nathalie Baracaldo, Mingyi Hong, Sijia Liu
Abstract:
Large language models (LLMs) have shown remarkable reasoning capabilities when trained with chain-of-thought (CoT) supervision. However, the long and verbose CoT traces, especially those distilled from large reasoning models (LRMs) such as DeepSeek-R1, significantly increase training costs during the distillation process, where a non-reasoning base model is taught to replicate the reasoning behavior of an LRM. In this work, we study the problem of CoT condensation for resource-efficient reasoning training, aimed at pruning intermediate reasoning steps (i.e., thoughts) in CoT traces, enabling supervised model training on length-reduced CoT data while preserving both answer accuracy and the model's ability to generate coherent reasoning. Our rationale is that CoT traces typically follow a three-stage structure: problem understanding, exploration, and solution convergence. Through empirical analysis, we find that retaining the structure of the reasoning trace, especially the early stage of problem understanding (rich in reflective cues) and the final stage of solution convergence, is sufficient to achieve lossless reasoning supervision. To this end, we propose an Edge-Preserving Condensation method, EPiC, which selectively retains only the initial and final segments of each CoT trace while discarding the middle portion. This design draws an analogy to preserving the "edge" of a reasoning trajectory, capturing both the initial problem framing and the final answer synthesis, to maintain logical continuity. Experiments across multiple model families (Qwen and LLaMA) and benchmarks show that EPiC reduces training time by over 34% while achieving lossless reasoning accuracy on MATH500, comparable to full CoT supervision. To the best of our knowledge, this is the first study to explore thought-level CoT condensation for efficient reasoning model distillation.
中文摘要:本研究提出EPiC方法,通过仅保留思维链的起始和结尾推理片段来压缩训练数据,在保持与完整监督相当推理准确性的同时,将训练时间减少超过34%。
English Summary: The study introduces EPiC, a method that condenses chain-of-thought traces by preserving only the initial and final reasoning segments, reducing training time by over 34% while maintaining reasoning accuracy comparable to full supervision.
Authors:Yinlong Xu, Yanzhao Zheng, Shuoshuo Sun, Shuaihan Huang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Hongxia Xu, Jian Wu
Abstract:
It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fall into the trap of local optimum reasoning, which means the model lacks a global perspective while solving problems. We propose a novel reasoning paradigm called Reason from Future (RFF), which generates reasoning paths by bidirectional reasoning that combines top-down planning with bottom-up reasoning accumulation. The essence of RFF lies in its reverse reasoning mechanism, which prioritizes core logical relationships and imposes goal-oriented constraints on intermediate steps, thereby reducing the searching space and mitigating error accumulation inherent in sequential forward reasoning. Empirical evaluations across diverse experiments demonstrate that RFF outperforms conventional paradigms with higher accuracy and less searching space to solve complex tasks.
中文: 提出的“未来推理”(RFF)范式通过结合自上而下规划和自下而上累积的双向推理,减少搜索空间和错误累积,从而在复杂任务中实现更高的准确性和效率。
English: The proposed Reason from Future (RFF) paradigm enhances reasoning by employing bidirectional reasoning with top-down planning and bottom-up accumulation, which reduces search space and error accumulation for improved accuracy and efficiency in complex tasks.
Authors:Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal
Abstract:
Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
中文:提出的Video-SKoT框架通过自动构建技能感知的思维链监督数据并采用专业化专家学习机制,有效提升了跨领域视频推理能力,在多个基准测试中均表现出优越性能。
English: The proposed Video-SKoT framework enhances domain-adaptive video reasoning by automatically generating skill-aware Chain-of-Thought supervisions and employing a specialized expert learning system, achieving superior performance across multiple benchmarks.
Authors:Ruibo Wang, Mustafa A. Kishk, Howard H. Yang, Mohamed-Slim Alouini
Abstract:
With the increase in global positioning service demands and the requirement for more precise positioning, assisting existing medium and high orbit satellite-enabled positioning systems with low Earth orbit (LEO) satellites has garnered widespread attention. However, providing low computational complexity performance analysis for hybrid LEO/MEO massive satellite constellations remains a challenge. In this article, we introduce for the first time the application of stochastic geometry (SG) framework in satellite-enabled positioning performance analysis and provide an analytical expression for the K-availiability probability and K-localizability probability under bidirectional beam alignment transmissions. The K-localizability probability, defined as the probability that at least K satellites can participate in the positioning process, serves as a prerequisite for positioning. Since the modeling of MEO satellite constellations within the SG framework has not yet been studied, we integrate the advantages of Cox point processes and binomial point processes, proposing a doubly stochastic binomial point process binomial point process for accurate modeling of MEO satellite constellations. Finally, we investigate the impact of constellation configurations and antenna patterns on the localizability performance of LEO, MEO, and hybrid MEO/LEO constellations. We also demonstrate the network performance gains brought to MEO positioning systems by incorporating assistance from LEO satellites.
中文摘要:本文首次将随机几何框架应用于卫星定位性能分析,提出了中轨卫星星座的新型建模方法,并验证了低轨卫星辅助中轨定位系统带来的网络性能提升。
English Summary: This article introduces a stochastic geometry framework to analyze the positioning performance of hybrid LEO/MEO satellite constellations, proposing a novel modeling approach and demonstrating performance gains from LEO satellite assistance.
Authors:Dingwei Chen, Ziqiang Liu, Feiteng Fang, Chak Tou Leong, Shiwen Ni, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang, Chengming Li
Abstract:
Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ''hallucinations'', remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs' internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.
中文: 提出的PLI方法通过插值处理早期层级来减少大语言模型中的幻觉现象,无需额外训练即可提升事实准确性,在多个数据集上优于现有基线方法。
English: The proposed PLI method reduces hallucinations in Large Language Models by interpolating premature layers, enhancing factuality without additional training and outperforming existing approaches across multiple datasets.
Authors:Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma
Abstract:
Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs' inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.
中文:METok是一种无需训练的多阶段事件驱动令牌压缩框架,通过动态去除冗余视觉令牌来加速视频大语言模型,在保持精度的同时显著提升效率。
English: METok is a training-free, multi-stage token compression framework that accelerates Video Large Language Models by dynamically eliminating redundant visual tokens, achieving significant efficiency gains while preserving accuracy.
Authors:Mengyue Wang, Shuo Chen, Kristian Kersting, Volker Tresp, Yunpu Ma
Abstract:
Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs' inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.
中文:METok是一种无需训练的多阶段事件驱动令牌压缩框架,通过动态去除冗余视觉令牌来加速视频大语言模型,在保持精度的同时显著提升效率。
English: METok is a training-free, multi-stage token compression framework that accelerates Video Large Language Models by dynamically eliminating redundant visual tokens, achieving significant efficiency gains while preserving accuracy.
Authors:Stefano Fiorini, Hakan Aktas, Iulia Duta, Stefano Coniglio, Pietro Morerio, Alessio Del Bue, Pietro Liò
Abstract:
Sheaf Neural Networks (SNNs) represent a powerful generalization of Graph Neural Networks (GNNs) that significantly improve our ability to model complex relational data. While directionality has been shown to substantially boost performance in graph learning tasks and is key to many real-world applications, existing SNNs fall short in representing it. To address this limitation, we introduce the Directed Cellular Sheaf, a special type of cellular sheaf designed to explicitly account for edge orientation. Building on this structure, we define a new sheaf Laplacian, the Directed Sheaf Laplacian, which captures both the graph's topology and its directional information. This operator serves as the backbone of the Directed Sheaf Neural Network (DSNN), the first SNN model to embed a directional bias into its architecture. Extensive experiments on nine real-world benchmarks show that DSNN consistently outperforms baseline methods.
中文: 定向层神经网络(DSNN)作为首个融入方向性偏置的层神经网络,利用新型定向层拉普拉斯算子提升复杂关系数据建模性能,在九个真实基准测试中持续优于基线方法。
English: The Directed Sheaf Neural Network (DSNN) is introduced as the first sheaf neural network to incorporate directional bias, leveraging a novel Directed Sheaf Laplacian to enhance performance in modeling complex relational data, consistently outperforming baselines across nine benchmarks.
Authors:Jintao Tong, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li
Abstract:
Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a source-domain dataset to unseen target-domain datasets with limited annotations. Current methods typically compare the distance between training and testing samples for mask prediction. However, we find an entanglement problem exists in this widely adopted method, which tends to bind sourcedomain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.
Chinese: 本文针对跨域少样本分割中的纠缠问题,提出通过加权分解ViT组件比较的方法,在1-shot和5-shot设置下分别实现了1.92%和1.88%的平均准确率提升,达到了最先进性能。
English: This paper addresses the entanglement problem in Cross-Domain Few-Shot Segmentation by proposing a method that learns to weigh comparisons of decomposed ViT components, achieving state-of-the-art performance improvements of 1.92% and 1.88% in 1-shot and 5-shot settings respectively.
Authors:Fan Gao, Dongyuan Li, Ding Xia, Fei Mi, Yasheng Wang, Lifeng Shang, Baojun Wang
Abstract:
Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.
中文摘要:本研究针对现有中文作文评估方法的不足,提出了一个多体裁的基准测试,通过细粒度评分框架对15个大型语言模型在四种主要文体中的表现进行了系统评估。
English Summary: This study introduces a multi-genre benchmark for Chinese essay writing to address the limitations of existing evaluation methods, proposing a fine-grained scoring framework and benchmarking 15 large language models across four major genres.
Authors:Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, Fei Mi
Abstract:
Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbf{KDRL}, a \textit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected rule-based rewards. We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance. Empirical results on multiple reasoning benchmarks demonstrate that KDRL outperforms GRPO and various KD baselines while achieving a favorable balance between performance and reasoning token efficiency. These findings indicate that integrating KD and RL serves as an effective and efficient strategy to train reasoning LLMs.
Chinese: KDRL框架将知识蒸馏与强化学习相结合,通过教师监督与自主探索的统一优化,在多推理基准测试中实现了更优的性能与效率平衡。
English: The KDRL framework unifies knowledge distillation and reinforcement learning to enhance large language models' reasoning by combining teacher supervision with self-exploration, achieving superior performance and efficiency across benchmarks.
Authors:Marcos V. Conde, Radu Timofte, Zihao Lu, Xiangyu Kong, Xiaoxia Xing, Fan Wang, Suejin Han, MinKyu Park, Tianyu Zhang, Xin Luo, Yeda Chen, Dong Liu, Li Pang, Yuhang Yang, Hongzhong Wang, Xiangyong Cao, Ruixuan Jiang, Senyan Xu, Siyuan Jiang, Xueyang Fu, Zheng-Jun Zha, Tianyu Hao, Yuhong He, Ruoqi Li, Yueqi Yang, Xiang Yu, Guanlan Hong, Minmin Yi, Yuanjia Chen, Liwen Zhang, Zijie Jin, Cheng Li, Lian Liu, Wei Song, Heng Sun, Yubo Wang, Jinghua Wang, Jiajie Lu, Watchara Ruangsan
Abstract:
This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.
中文摘要:本文综述了NTIRE 2025 RAW图像复原与超分辨率挑战赛,展示了来自230名参赛者的创新解决方案和成果,旨在推动图像信号处理技术的发展。
English Summary: This paper reviews the NTIRE 2025 challenge focusing on RAW image restoration and super-resolution, presenting innovative solutions and results from 230 participants to advance Image Signal Processing pipelines.
Authors:Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou
Abstract:
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
Chinese: 本研究提出了一个细粒度评估框架,用于分析推理增强大语言模型的知识正确性和推理质量,发现监督微调虽提高最终答案准确率却常损害推理质量,而强化学习能通过剔除不准确知识来优化医学推理。
English: This study introduces a fine-grained evaluation framework to assess the knowledge correctness and reasoning quality of reasoning-enhanced Large Language Models, revealing that supervised fine-tuning boosts final-answer accuracy but often degrades reasoning, while reinforcement learning improves medical reasoning by filtering out inaccurate knowledge.
Authors:Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao, Xiang Yu, Shangbin Xie, Mengyuan Sun, Huanjing Yue, Jingyu Yang Huize Cheng, Shaomeng Zhang, Zhaoyang Zhang, Haoxiang Liang
Abstract:
Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data.
中文: 本文介绍了NTIRE 2025挑战赛中从无元数据的sRGB图像重建RAW图像的任务,超过150名参与者开发了先进模型来逆转ISP转换并生成逼真的RAW传感器数据。
English: This paper presents the NTIRE 2025 challenge on reconstructing RAW images from sRGB data without metadata, where over 150 participants developed state-of-the-art models to reverse the ISP transformation and generate realistic RAW sensor data.
Authors:Zekun Wang, Minghua Ma, Zexin Wang, Rongchuan Mu, Liping Shan, Ming Liu, Bing Qin
Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practical deployment. While efforts to improve LVLM efficiency are growing, existing methods lack comprehensive evaluation across diverse backbones, benchmarks, and metrics. In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression. We introduce EffiVLM-Bench, a unified framework for assessing not only absolute performance but also generalization and loyalty, while exploring Pareto-optimal trade-offs. Our extensive experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs. We open-source code and recipes for EffiVLM-Bench to foster future research.
中文: 本研究系统评估了大型视觉语言模型的主流加速技术,推出EffiVLM-Bench统一框架以评估性能、泛化性和忠实度,同时探索帕累托最优权衡,并开源代码推动后续研究。
English: This study systematically evaluates mainstream acceleration techniques for large vision-language models, introducing EffiVLM-Bench as a unified framework to assess performance, generalization, and loyalty while exploring optimal trade-offs, with open-sourced code to advance future research.
Authors:Xiaoyang Li, Linwei Tao, Haohui Lu, Minjing Dong, Junbin Gao, Chang Xu
Abstract:
Graph Neural Networks (GNNs) have demonstrated strong predictive performance on relational data; however, their confidence estimates often misalign with actual predictive correctness, posing significant limitations for deployment in safety-critical settings. While existing graph-aware calibration methods seek to mitigate this limitation, they primarily depend on coarse one-hop statistics, such as neighbor-predicted confidence, or latent node embeddings, thereby neglecting the fine-grained structural heterogeneity inherent in graph topology. In this work, we propose Wavelet-Aware Temperature Scaling (WATS), a post-hoc calibration framework that assigns node-specific temperatures based on tunable heat-kernel graph wavelet features. Specifically, WATS harnesses the scalability and topology sensitivity of graph wavelets to refine confidence estimates, all without necessitating model retraining or access to neighboring logits or predictions. Extensive evaluations across seven benchmark datasets with varying graph structures and two GNN backbones demonstrate that WATS achieves the lowest Expected Calibration Error (ECE) among all compared methods, outperforming both classical and graph-specific baselines by up to 42.3\% in ECE and reducing calibration variance by 17.24\% on average compared with graph-specific methods. Moreover, WATS remains computationally efficient, scaling well across graphs of diverse sizes and densities. Code will be released based on publication.
中文摘要:图神经网络的置信度估计常与实际准确性不符,因此我们提出WATS这一无需重新训练的后处理校准框架,通过图小波特征优化置信度,在多个基准测试中实现了最低的校准误差。
English Summary: Graph Neural Networks often have misaligned confidence estimates, so we propose WATS, a post-hoc calibration method using graph wavelets to improve accuracy without retraining, achieving the lowest calibration error across multiple benchmarks.
Authors:Sergio Mazzola, Gabriele Ara, Thomas Benz, Björn Forsberg, Tommaso Cucinotta, Luca Benini
Abstract:
Energy-centric design is paramount in the current embedded computing era: use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Hardware heterogeneity and parallelism help address the efficiency challenge, but greatly complicate online power consumption assessments, which are essential for dynamic hardware and software stack adaptations. We introduce a novel power modeling methodology with state-of-the-art accuracy, low overhead, and high responsiveness, whose implementation does not rely on microarchitectural details. Our methodology identifies the Performance Monitoring Counters (PMCs) with the highest linear correlation to the power consumption of each hardware sub-system, for each Dynamic Voltage and Frequency Scaling (DVFS) state. The individual, simple models are composed into a complete model that effectively describes the power consumption of the whole system, achieving high accuracy and low overhead. Our evaluation reports an average estimation error of 7.5% for power consumption and 1.3% for energy. We integrate these models in the Linux kernel with Runmeter, an open-source, PMC-based monitoring framework. Runmeter manages PMC sampling and processing, enabling the execution of our power models at runtime. With a worst-case time overhead of only 0.7%, Runmeter provides responsive and accurate power measurements directly in the kernel. This information can be employed for actuation policies in workload-aware DVFS and power-aware, closed-loop task scheduling.
中文: 本文提出了一种新型功耗建模方法,通过识别与功耗高度相关的性能监控计数器实现高精度低开销,并集成到Linux内核中以支持动态系统调节。
English: This paper presents a novel power modeling methodology that achieves high accuracy and low overhead by identifying key performance monitoring counters correlated with power consumption, enabling dynamic system adaptations through integration in the Linux kernel.
Authors:Haolan Guo, Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu
Abstract:
Recent advances in deep learning have significantly improved predictive accuracy. However, modern neural networks remain systematically overconfident, posing risks for deployment in safety-critical scenarios. Current post-hoc calibration methods face a fundamental dilemma: global approaches like Temperature Scaling apply uniform adjustments across all samples, introducing high bias despite computational efficiency, while more expressive methods that operate on full logit distributions suffer from high variance due to noisy high-dimensional inputs and insufficient validation data. To address these challenges, we propose Sample Margin-Aware Recalibration of Temperature (SMART), a lightweight, data-efficient recalibration method that precisely scales logits based on the margin between the top two logits -- termed the logit gap. Specifically, the logit gap serves as a denoised, scalar signal directly tied to decision boundary uncertainty, providing a robust indicator that avoids the noise inherent in high-dimensional logit spaces while preserving model prediction invariance. Meanwhile, SMART employs a novel soft-binned Expected Calibration Error (SoftECE) objective that balances model bias and variance through adaptive binning, enabling stable parameter updates even with extremely limited calibration data. Extensive evaluations across diverse datasets and architectures demonstrate that SMART achieves state-of-the-art calibration performance even with substantially fewer parameters compared to existing parametric methods, offering a principled, robust, and highly efficient solution for practical uncertainty quantification in neural network predictions. The source code is available at: https://anonymous.4open.science/r/SMART-8B11.
中文摘要:本文提出的SMART方法通过基于最大两个预测值之间的间隔自适应调整逻辑值,并采用新型校准目标,以极少参数在多种数据集上实现了最优的校准性能。
English Summary: The proposed SMART method addresses neural network overconfidence by adaptively scaling logits based on the margin between top predictions and employing a novel calibration objective, achieving superior performance with minimal parameters across diverse datasets.
Authors:Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng, Yunfei Ge, Peiqing Cong, Guannan He, Zhaoming Han, Ruocheng Yin, Jingxiang Guo, Lunkai Lin, Tianling Xu, Hongzhe Bi, Xuewu Lin, Tianwei Lin, Shujie Luo, Keyu Li, Ziyan Zhao, Ke Fan, Heyang Xu, Bo Peng, Wenlong Gao, Dongjiang Li, Feng Jin, Hui Shen, Jinming Li, Chaowei Cui, Yu Chen, Yaxin Peng, Lingdong Zeng, Wenlong Dong, Tengfei Li, Weijie Ke, Jun Chen, Erdemt Bao, Tian Lan, Tenglong Liu, Jin Yang, Huiping Zhuang, Baozhi Jia, Shuai Zhang, Zhengfeng Zou, Fangheng Guan, Tianyi Jia, Ke Zhou, Hongjiu Zhang, Yating Han, Cheng Fang, Yixian Zou, Chongyang Xu, Qinglun Zhang, Shen Cheng, Xiaohe Wang, Ping Tan, Haoqiang Fan, Shuaicheng Liu, Jiaheng Chen, Chuxuan Huang, Chengliang Lin, Kaijun Luo, Boyu Yue, Yi Liu, Jinyu Chen, Zichang Tan, Liming Deng, Shuo Xu, Zijian Cai, Shilong Yin, Hao Wang, Hongshan Liu, Tianyang Li, Long Shi, Ran Xu, Huilin Xu, Zhengquan Zhang, Congsheng Xu, Jinchang Yang, Feng Xu
Abstract:
Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To advance this goal, we launched the RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025. Built on the RoboTwin Simulation platform (1.0 and 2.0) and the AgileX COBOT-Magic Robot platform, the competition consisted of three stages: Simulation Round 1, Simulation Round 2, and a final Real-World Round. Participants totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios. The challenge attracted 64 global teams and over 400 participants, producing top-performing solutions like SEM and AnchorDP3 and generating valuable insights into generalizable bimanual policy learning. This report outlines the competition setup, task design, evaluation methodology, key findings and future direction, aiming to support future research on robust and generalizable bimanual manipulation policies. The Challenge Webpage is available at https://robotwin-benchmark.github.io/cvpr-2025-challenge/.
中文: CVPR 2025的RoboTwin双机械臂协作挑战赛通过仿真和现实阶段,让全球团队完成17项双臂操作任务,为可泛化的双手策略学习提供了重要见解,推动了具身人工智能的发展。
English: The RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 advanced Embodied AI by engaging global teams in 17 dual-arm manipulation tasks, yielding key insights for generalizable bimanual policies through simulation and real-world stages.
Authors:Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
Abstract:
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
中文: SoMi-ToM基准通过多模态数据评估具身多智能体社交互动中的心理理论能力,结果显示大型视觉语言模型在第一人称和第三人称评估中均显著落后于人类表现。
English: The SoMi-ToM benchmark is introduced to assess Theory of Mind in embodied multi-agent social interactions using multimodal data, revealing that large vision-language models significantly underperform humans in both first-person and third-person evaluations.
Authors:Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
Abstract:
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
中文: SoMi-ToM基准通过多模态数据评估具身多智能体社交互动中的心理理论能力,结果显示大型视觉语言模型在第一人称和第三人称评估中均显著落后于人类表现。
English: The SoMi-ToM benchmark is introduced to assess Theory of Mind in embodied multi-agent social interactions using multimodal data, revealing that large vision-language models significantly underperform humans in both first-person and third-person evaluations.
Authors:Rui Xu, Yunke Wang, Yong Luo, Bo Du
Abstract:
Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.
中文摘要:VisionDrop提出了一种无需训练的视觉令牌剪枝框架,通过模态内注意力机制减少大型视觉语言模型的计算负担,在保持性能的同时显著提升效率,且不依赖文本信号。
English Summary: VisionDrop introduces a training-free visual token pruning framework that uses intra-modal attention to reduce computational overhead in Large Vision-Language Models while preserving performance, achieving significant efficiency gains without relying on textual guidance.
Authors:Hakan Ãapuk, Andrew Bond, Muhammed Burak Kızıl, Emir Göçen, Erkut Erdem, Aykut Erdem
Abstract:
Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.
中文摘要:TanDiT通过生成覆盖360度视角的切平面图像网格,采用统一扩散模型和后处理技术解决全景图像生成中的几何扭曲和循环一致性问题,并提出了专用评估指标来提升生成质量。
English Summary: TanDiT introduces a unified diffusion model and post-processing technique to generate high-quality panoramic images by synthesizing tangent-plane image grids, addressing geometric distortion and loop-consistency challenges with specialized evaluation metrics.
Authors:Jiajia Guo, Yiming Cui, Shi Jin
Abstract:
Artificial intelligence (AI) substantially enhances channel state information (CSI) acquisition performance but is limited by its reliance on single-modality information and deployment challenges, particularly in dataset collection. This paper investigates the use of semantic-aware digital twin (DT) to enhance AI-based CSI acquisition. We first briefly introduce the motivation and recent advancements in AI-driven CSI acquisition and semantic-aware DT employment for air interfaces. Then, we thoroughly explore how semantic-aware DT can bolster AI-based CSI acquisition. We categorizes the semantic-aware DT for AI-based CSI acquisition into two classes: enhancing AI-based CSI acquisition through integration with DT and using DT to aid AI-based CSI deployment. Potential integration frameworks are introduced in detail. Finally, we conclude by outlining potential research directions within the semantic-aware DT-assisted AI-based CSI acquisition.
中文: 本文探讨了语义感知数字孪生如何通过解决单模态数据和部署困难等局限性来改进基于人工智能的信道状态信息获取,并提出了集成框架和未来研究方向。
English: This paper explores how semantic-aware digital twins can improve AI-based channel state information acquisition by addressing its limitations in single-modality data and deployment difficulties, while also proposing integration frameworks and future research directions.
Authors:Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu
Abstract:
Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.
中文: HiT跟踪模型系列通过桥接模块和双图像位置编码实现跨设备高效运行,而DyHiT则采用动态路由机制根据场景复杂度自适应调整计算量,进一步优化了性能与速度的平衡。
English: The HiT tracking model family achieves high efficiency and speed across devices through its Bridge Module and dual-image position encoding, while DyHiT further enhances performance with dynamic routing that adapts computational load to scene complexity.
Authors:Dong Liu, Sander Timmerman, Yu Xiang, Peter Palensky, Pedro P. Vergara
Abstract:
This paper introduces a data-driven topology identification and correction approach for low-voltage distribution networks (LVDNs) combined with a time-based smart meter data selection strategy, aiming to correct outdated recordings and identify the missed recordings. The proposed approach solely relies on voltage magnitude measurements, releasing privacy concerns and measurement burdens. It enables the distribution system operators to identify switch states through supervised learning algorithms, as well as determine user-feeder connections and phase labels of customers by a modified Hierarchical Clustering algorithm. To address the similarity among smart meter (SM) data caused by distributed photovoltaic (PV) systems, a time-based SM data selection strategy is combined with the proposed correlation analysis. The feasibility and robustness of the proposed approach are validated using modified real-world LVDNs and multiple incomplete SM datasets collected from customers in the Netherlands. The results demonstrate that the time-based SM data selection strategy effectively mitigates their impact on phase identification, and the corrected topology not only improves network observability but also supports network operators in load balancing and PV consumption.
中文摘要:本文提出一种基于智能电表电压数据和时序选择策略的低压配电网拓扑识别与校正方法,能够准确识别开关状态和用户相位连接,有效缓解光伏系统影响并提升电网可观测性。
English Summary: This paper presents a data-driven method using smart meter voltage data and time-based selection to correct low-voltage network topologies, enabling accurate switch state identification and customer-phase mapping while addressing photovoltaic system impacts.
Authors:Zhiyuan Wang, Jinhao Duan, Qingni Wang, Xiaofeng Zhu, Tianlong Chen, Xiaoshuang Shi, Kaidi Xu
Abstract:
Uncertainty quantification (UQ) for foundation models is essential to identify and mitigate potential hallucinations in automatically generated text. However, heuristic UQ approaches lack formal guarantees for key metrics such as the false discovery rate (FDR) in selective prediction. Previous work adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing prediction sets, but these sets often contain incorrect candidates, limiting their practical utility. To address this, we propose COIN, an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on a calibration set and applies confidence interval methods such as Clopper-Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest uncertainty threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative upper bound constructions and UQ strategies can further boost COIN's power performance, which underscores its extensibility and adaptability to diverse application scenarios.
中文: COIN框架通过统计校准不确定性阈值,在用户指定的错误发现率约束下筛选生成文本,在多种生成任务中展现出可靠的风险控制和更高的答案保留率。
English: The proposed COIN framework statistically calibrates uncertainty thresholds to filter generated text under user-specified false discovery rate constraints, demonstrating robust risk control and improved answer retention across diverse generation tasks.
Authors:Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu
Abstract:
Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.
中文:提出的V2T-CoT方法通过自动定位病灶区域并将其融入视觉推理链,在多个医疗问答基准测试中实现了卓越性能与可解释性的双重突破。
English: The proposed V2T-CoT method enhances medical visual question answering by automatically localizing disease-specific regions and integrating them into visual reasoning chains, achieving superior performance and interpretability across benchmarks.
Authors:Lyuye Zhang, Jian Zhang, Kaixuan Li, Chong Wang, Chengwei Liu, Jiahui Wu, Sen Chen, Yaowen Zheng, Yang Liu
Abstract:
Software Composition Analysis (SCA) has become pivotal in addressing vulnerabilities inherent in software project dependencies. In particular, reachability analysis is increasingly used in Open-Source Software (OSS) projects to identify reachable vulnerabilities (e.g., CVEs) through call graphs, enabling a focus on exploitable risks. Performing reachability analysis typically requires the vulnerable function (VF) to track the call chains from downstream applications. However, such crucial information is usually unavailable in modern vulnerability databases like NVD. While directly extracting VF from modified functions in vulnerability patches is intuitive, patches are not always available. Moreover, our preliminary study shows that over 26% of VF do not exist in the modified functions. Meanwhile, simply ignoring patches to search vulnerable functions suffers from overwhelming noises and lexical gaps between descriptions and source code. Given that almost half of the vulnerabilities are equipped with patches, a holistic solution that handles both scenarios with and without patches is required. To meet real-world needs and automatically localize VF, we present VFArchÄ, a dual-mode approach designed for disclosed vulnerabilities, applicable in scenarios with or without available patch links. The experimental results of VFArchÄ on our constructed benchmark dataset demonstrate significant efficacy regarding three metrics, achieving 1.3x and 1.9x Mean Reciprocal Rank over the best baselines for Patch-present and Patch-absent modes, respectively. Moreover, VFArchÄ has proven its applicability in real-world scenarios by successfully locating VF for 43 out of 50 latest vulnerabilities with reasonable efforts and significantly reducing 78-89% false positives of SCA tools.
中文摘要:VFArchae是一种双模式方法,可自动定位软件依赖中的易受攻击函数,有效处理有补丁和无补丁两种场景,显著提升软件成分分析的准确性并大幅减少误报。
English Summary: VFArchae is a dual-mode approach that automatically localizes vulnerable functions in software dependencies, effectively handling both patch-present and patch-absent scenarios while significantly improving accuracy and reducing false positives in software composition analysis.
Authors:Wang Lingxiang, Quanzhi Fu, Wenjia Song, Gelei Deng, Yi Liu, Dan Williams, Ying Zhang
Abstract:
The integration of open-source third-party library dependencies in Java development introduces significant security risks when these libraries contain known vulnerabilities. Existing Software Composition Analysis (SCA) tools struggle to effectively detect vulnerable API usage from these libraries due to limitations in understanding API usage semantics and computational challenges in analyzing complex codebases, leading to inaccurate vulnerability alerts that burden development teams and delay critical security fixes.
To address these challenges, we proposed SAVANT by leveraging two insights: proof-of-vulnerability test cases demonstrate how vulnerabilities can be triggered in specific contexts, and Large Language Models (LLMs) can understand code semantics. SAVANT combines semantic preprocessing with LLM-powered context analysis for accurate vulnerability detection. SAVANT first segments source code into meaningful blocks while preserving semantic relationships, then leverages LLM-based reflection to analyze API usage context and determine actual vulnerability impacts. Our evaluation on 55 real-world applications shows that SAVANT achieves 83.8% precision, 73.8% recall, 69.0% accuracy, and 78.5% F1-score, outperforming state-of-the-art SCA tools.
中文摘要:Java开发中第三方库的安全漏洞带来显著风险,现有SCA工具难以有效检测API漏洞;提出的SAVANT方法结合语义预处理和大型语言模型分析,在真实应用评估中实现了83.8%的精确度和78.5%的F1分数,性能优于现有工具。
English Summary: Java development faces security risks from vulnerable third-party libraries, and current SCA tools often fail to detect API vulnerabilities accurately; the proposed SAVANT method uses semantic preprocessing and LLM analysis to significantly improve detection performance, achieving over 83% precision in evaluations.
Authors:Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan
Abstract:
We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space rather than SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design. Project page: https://assembler3d.github.io
中文: Assembler是一个可扩展的三维零件组装框架,通过扩散模型和新型形状表示,能够从零件网格和参考图像重建完整物体,在多样化数据集上实现了最先进的性能。
English: Assembler is a scalable 3D part assembly framework that uses diffusion models and a novel shape representation to reconstruct complete objects from part meshes and reference images, achieving state-of-the-art performance on diverse datasets.
Authors:Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu
Abstract:
In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.
中文摘要:本文提出InstructTTSEval基准,旨在评估语音合成系统执行复杂自然语言指令以控制副语言特征的能力,发现现有模型仍有较大改进空间。
English Summary: This paper introduces InstructTTSEval, a benchmark designed to evaluate how well text-to-speech systems follow complex natural language instructions for controlling paralinguistic features, revealing significant room for improvement in current models.
Authors:Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, Francois G. Germain, Gordon Wichern, Jonathan Le Roux
Abstract:
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
中文: HAC是一种统一的神经语音编解码器,通过知识蒸馏将语音分层分解为声学、音素和词汇三个层面,在解缠和重建质量上表现优异,提供了可解释的语言表征。
English: HAC is a unified neural speech codec that hierarchically factorizes speech into acoustic, phonetic, and lexical levels using knowledge distillation, achieving superior disentanglement and reconstruction quality for interpretable linguistic representations.
Authors:Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto
Abstract:
Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.
中文: 本研究提出DeVisE行为测试框架,发现大语言模型在临床决策中常混淆真实医学推理与表面模式,零样本模型展现出更连贯的反事实推理能力,而微调模型虽更稳定但对关键临床变化的响应较弱。
English: The study introduces DeVisE, a behavioral testing framework that reveals large language models (LLMs) often fail to distinguish genuine clinical reasoning from superficial patterns, with zero-shot models showing more coherent counterfactual reasoning while fine-tuned ones are more stable but less responsive to meaningful clinical changes.
Authors:Matteo Zecchin, Tomer Raviv, Dileep Kalathil, Krishna Narayanan, Nir Shlezinger, Osvaldo Simeone
Abstract:
In recent years, deep learning has facilitated the creation of wireless receivers capable of functioning effectively in conditions that challenge traditional model-based designs. Leveraging programmable hardware architectures, deep learning-based receivers offer the potential to dynamically adapt to varying channel environments. However, current adaptation strategies, including joint training, hypernetwork-based methods, and meta-learning, either demonstrate limited flexibility or necessitate explicit optimization through gradient descent. This paper presents gradient-free adaptation techniques rooted in the emerging paradigm of in-context learning (ICL). We review architectural frameworks for ICL based on Transformer models and structured state-space models (SSMs), alongside theoretical insights into how sequence models effectively learn adaptation from contextual information. Further, we explore the application of ICL to cell-free massive MIMO networks, providing both theoretical analyses and empirical evidence. Our findings indicate that ICL represents a principled and efficient approach to real-time receiver adaptation using pilot signals and auxiliary contextual information-without requiring online retraining.
中文摘要:本文提出基于上下文学习的无梯度自适应技术,通过理论分析和实证研究证明其在无在线重训练条件下,能利用导频信号实现无线接收机的实时高效自适应。
English Summary: This paper introduces gradient-free adaptation techniques using in-context learning for wireless receivers, enabling real-time adaptation without online retraining through theoretical frameworks and empirical validation in cell-free massive MIMO networks.
Authors:Shengjia Zhang, Jiawei Chen, Changdong Li, Sheng Zhou, Qihao Shi, Yan Feng, Chun Chen, Can Wang
Abstract:
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations -- stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.
中文: 本研究分析了推荐系统中的Softmax损失和余弦对比损失,揭示了它们通过分布鲁棒优化共有的鲁棒性及各自局限性,并提出DrRL这一利用Rényi散度的新型损失函数,在提升推荐精度与鲁棒性的同时有效克服了原有缺陷。
English: This study analyzes Softmax Loss and Cosine Contrastive Loss in recommendation systems, revealing their shared robustness through Distributional Robust Optimization but distinct limitations, and introduces DrRL, a novel loss function using Rényi-divergence to overcome these drawbacks while enhancing accuracy and robustness.
Authors:Matteo Nerini, Bruno Clerckx
Abstract:
To meet the demands of future wireless networks, antenna arrays must scale from massive multiple-input multiple-output (MIMO) to gigantic MIMO, involving even larger numbers of antennas. To address the hardware and computational cost of gigantic MIMO, several strategies are available that shift processing from the digital to the analog domain. Among them, microwave linear analog computers (MiLACs) offer a compelling solution by enabling fully analog beamforming through reconfigurable microwave networks. Prior work has focused on fully-connected MiLACs, whose ports are all interconnected to each other via tunable impedance components. Although such MiLACs are capacity-achieving, their circuit complexity, given by the number of required impedance components, scales quadratically with the number of antennas, limiting their practicality. To solve this issue, in this paper, we propose a graph theoretical model of MiLAC facilitating the systematic design of lower-complexity MiLAC architectures. Leveraging this model, we propose stem-connected MiLACs as a family of MiLAC architectures maintaining capacity-achieving performance while drastically reducing the circuit complexity. Besides, we optimize stem-connected MiLACs with a closed-form capacity-achieving solution. Our theoretical analysis, confirmed by numerical simulations, shows that stem-connected MiLACs are capacity-achieving, but with circuit complexity that scales linearly with the number of antennas, enabling high-performance, scalable, gigantic MIMO.
Chinese: 本文提出了一种基于图论的MiLAC设计模型,通过构建茎连接架构在保持容量最优性能的同时,将电路复杂度从天线数量的二次方降低至线性增长,为实现可扩展的高性能巨型MIMO系统提供了解决方案。
English: This paper introduces a graph-theoretical model for designing stem-connected microwave linear analog computers (MiLACs), which achieve optimal capacity performance while reducing circuit complexity from quadratic to linear scaling with the number of antennas, enabling scalable and high-performance gigantic MIMO systems.
Authors:Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong
Abstract:
Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow iterative optimization, limiting scalability in unconstrained scenarios. Recent feed-forward approaches enable efficient single-image reconstruction but struggle to effectively leverage multiple input images to reduce ambiguity and improve reconstruction accuracy. To address these challenges, we propose PF-LHM, a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images. Our approach introduces an efficient Encoder-Decoder Point-Image Transformer architecture, which fuses hierarchical geometric point features and multi-view image features through multimodal attention. The fused features are decoded to recover detailed geometry and appearance, represented using 3D Gaussian splats. Extensive experiments on both real and synthetic datasets demonstrate that our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations. Code and models will be released to the public.
中文: PF-LHM是一种创新模型,能够从单张或多张随意拍摄的无相机与姿态标注图像中高效重建高质量、可动画的3D人体化身,通过多模态注意力架构融合点云与图像特征,实现精细的几何与外观还原。
English: PF-LHM is a novel model that efficiently reconstructs high-quality, animatable 3D human avatars from single or multiple casually captured images without camera or pose data, using a multimodal attention-based architecture to fuse point and image features for detailed geometry and appearance recovery.
Authors:Songtao Jiang, Yuan Wang, Ruizhe Chen, Yan Zhang, Ruilin Luo, Bohan Lei, Sibo Song, Yang Feng, Jimeng Sun, Jian Wu, Zuozhu Liu
Abstract:
In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significantly enhance both reasoning capabilities and overall model performance. However, their application in medical domains is hindered by two fundamental challenges: 1) misalignment between perceptual understanding and reasoning stages, and 2) inconsistency between reasoning pathways and answer generation, both compounded by the scarcity of high-quality medical datasets for effective large-scale RL. In this paper, we first introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks. Moreover, we propose a novel large-scale RL framework for Med-VLMs, Consistency-Aware Preference Optimization (CAPO), which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer derivation, and rule-based accuracy for final responses. Extensive experiments on both in-domain and out-of-domain scenarios demonstrate the superiority of our method over strong VLM baselines, showcasing strong generalization capability to 3D Med-VQA benchmarks and R1-like training paradigms.
中文摘要:本文针对医学视觉问答中的挑战,提出了专门数据集Med-Zero-17K和新型强化学习框架CAPO,通过确保感知、推理与答案生成间的一致性,在多种医疗场景中展现出优越性能。
English Summary: This paper addresses challenges in medical visual question answering by introducing a curated dataset, Med-Zero-17K, and proposing a novel reinforcement learning framework called CAPO that ensures consistency between perception, reasoning, and answer generation, demonstrating superior performance across various medical scenarios.
Authors:Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui
Abstract:
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
Chinese: 本研究针对用户界面教学视频提出了一种新的多模态摘要基准,通过手动标注的2,413个视频数据集提供逐步执行指令和关键帧,填补了现有方法的不足,并强调了开发专门技术的必要性。
English: This research introduces a new benchmark for multi-modal summarization of user interface instructional videos, addressing the limitations of existing methods by providing step-by-step instructions and key frames through a manually annotated dataset of 2,413 videos, which reveals the inadequacy of current approaches and underscores the need for specialized techniques.
Authors:Xingzhong Hou, Jie Wu, Boxiao Liu, Yi Zhang, Guanglu Song, Yunpeng Liu, Yu Liu, Haihang You
Abstract:
Image inpainting is the task of reconstructing missing or damaged parts of an image in a way that seamlessly blends with the surrounding content. With the advent of advanced generative models, especially diffusion models and generative adversarial networks, inpainting has achieved remarkable improvements in visual quality and coherence. However, achieving seamless continuity remains a significant challenge. In this work, we propose two novel methods to address discrepancy issues in diffusion-based inpainting models. First, we introduce a modified Variational Autoencoder that corrects color imbalances, ensuring that the final inpainted results are free of color mismatches. Second, we propose a two-step training strategy that improves the blending of generated and existing image content during the diffusion process. Through extensive experiments, we demonstrate that our methods effectively reduce discontinuity and produce high-quality inpainting results that are coherent and visually appealing.
中文: 本研究提出了两种新技术——改进的变分自编码器用于色彩校正和两步训练策略,通过减少不连续性和提升视觉一致性来增强基于扩散模型的图像修复效果。
English: This study introduces two novel techniques—a modified Variational Autoencoder for color correction and a two-step training strategy—to enhance diffusion-based image inpainting by reducing discontinuities and improving visual coherence.
Authors:Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen
Abstract:
The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance.
To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.
中文总结:本文提出一种面向大语言模型的思维语言(LLM-TL),通过两阶段推理流程使大模型能自动生成适配不同GPU的高性能FlashAttention实现,在保持35.16倍加速的同时将开发时间从数月缩短至分钟级,并扩展了对未支持硬件和数据类型的兼容性。
English Summary: This paper introduces an LLM-friendly Thinking Language (LLM-TL) that enables large language models to automatically generate high-performance GPU implementations of FlashAttention, achieving up to 35.16x speedup while supporting diverse hardware architectures and reducing development time from months to minutes.
Authors:Yuntao Shou, Jun Yao, Tao Meng, Wei Ai, Cen Chen, Keqin Li
Abstract:
Multimodal emotion recognition in conversations (MERC) aims to infer the speaker's emotional state by analyzing utterance information from multiple sources (i.e., video, audio, and text). Compared with unimodality, a more robust utterance representation can be obtained by fusing complementary semantic information from different modalities. However, the modality missing problem severely limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions through the graph diffusion model to obtain more powerful modal recovery capabilities. Unfortunately, existing graph diffusion models may destroy the connectivity and local structure of the graph by directly adding Gaussian noise to the adjacency matrix, resulting in the generated graph data being unable to retain the semantic and topological information of the original graph. To this end, we propose a novel Graph Spectral Diffusion Network (GSDNet), which maps Gaussian noise to the graph spectral space of missing modalities and recovers the missing data according to its original distribution. Compared with previous graph diffusion methods, GSDNet only affects the eigenvalues of the adjacency matrix instead of destroying the adjacency matrix directly, which can maintain the global topological information and important spectral features during the diffusion process. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios.
Chinese: 提出的图谱扩散网络(GSDNet)通过将高斯噪声映射到图谱空间来解决多模态情感识别中的模态缺失问题,有效保留全局拓扑结构和谱特征,实现了最优的情感识别性能。
English: The proposed Graph Spectral Diffusion Network (GSDNet) addresses modality loss in multimodal emotion recognition by mapping Gaussian noise to the graph spectral space, preserving global topology and spectral features to achieve state-of-the-art performance.
Authors:Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum
Abstract:
We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
Chinese: Athena-PRM是一种多模态过程奖励模型,通过利用强弱完成者之间的预测一致性来高效评估推理步骤,仅需少量数据即可在多个基准测试中实现最优性能,并采用有效策略提升效果。
English: Athena-PRM is a multimodal process reward model that efficiently evaluates reasoning steps using prediction consistency between weak and strong completers, achieving state-of-the-art performance across multiple benchmarks with minimal data and enhanced strategies.
Authors:Xinyu Peng, Ziyang Zheng, Yaoming Wang, Han Li, Nuowen Kan, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
Abstract:
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
中文摘要:NCVSD是一种创新方法,通过将无条件评分函数特性融入变分评分蒸馏框架,将预训练扩散模型提炼为生成式去噪器,既能实现快速单步生成,又能通过多步采样提升样本质量。
English Summary: NCVSD is a novel method that distills pretrained diffusion models into generative denoisers by leveraging insights about unconditional score functions within the VSD framework, enabling fast one-step generation and improved sample quality through scalable multi-step sampling.
Authors:Wenbing Tang, Mingfei Cheng, Renzhi Wang, Yuan Zhou, Chengwei Liu, Yang Liu, Zuohua Ding
Abstract:
Simulation-based testing is essential for evaluating the safety of Autonomous Driving Systems (ADSs). Comprehensive evaluation requires testing across diverse scenarios that can trigger various types of violations under different conditions. While existing methods typically focus on individual diversity metrics, such as input scenarios, ADS-generated motion commands, and system violations, they often fail to capture the complex interrelationships among these elements. This oversight leads to gaps in testing coverage, potentially missing critical issues in the ADS under evaluation. However, quantifying these interrelationships presents a significant challenge. In this paper, we propose a novel causality-aware fuzzing technique, Causal-Fuzzer, to enable efficient and comprehensive testing of ADSs by exploring causally diverse scenarios. The core of Causal-Fuzzer is constructing a causal graph to model the interrelationships among the diversities of input scenarios, ADS motion commands, and system violations. Then the causal graph will guide the process of critical scenario generation. Specifically, Causal-Fuzzer proposes (1) a causality-based feedback mechanism that quantifies the combined diversity of test scenarios by assessing whether they activate new causal relationships, and (2) a causality-driven mutation strategy that prioritizes mutations on input scenario elements with higher causal impact on ego action changes and violation occurrence, rather than treating all elements equally. We evaluated Causal-Fuzzer on an industry-grade ADS Apollo, with a high-fidelity. Our empirical results demonstrate that Causal-Fuzzer significantly outperforms existing methods in (1) identifying a greater diversity of violations, (2) providing enhanced testing sufficiency with improved coverage of causal relationships, and (3) achieving greater efficiency in detecting the first critical scenarios.
中文: 本文提出Causal-Fuzzer这一因果感知模糊测试技术,通过建模输入场景、运动指令和系统违规之间的因果关系来生成关键场景,相较于现有方法显著提升了自动驾驶系统测试的多样性、充分性和效率。
English: This paper introduces Causal-Fuzzer, a causality-aware fuzzing technique that models the interrelationships among input scenarios, motion commands, and system violations to generate critical scenarios, significantly improving the diversity, sufficiency, and efficiency of autonomous driving system testing compared to existing methods.
Authors:Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, Jian Wu
Abstract:
Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbf{SurgBench-P}, and an evaluation benchmark, \textbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.
手术视频理解对于自动化决策和技能评估至关重要,但进展受限于大规模数据集的缺乏,而SurgBench通过全面的预训练数据集和评估基准解决了这一问题,显著提升了模型性能和泛化能力。
Surgical video understanding is crucial for automated decision-making and skill assessment, but progress is limited by the lack of large-scale datasets, which SurgBench addresses with a comprehensive pretraining dataset and evaluation benchmark to enhance model performance and generalization.
Authors:Tan Chen, Jintao Yan, Yuxuan Sun, Sheng Zhou, Zhisheng Niu
Abstract:
Federated learning (FL) is a promising paradigm for multiple devices to cooperatively train a model. When applied in wireless networks, two issues consistently affect the performance of FL, i.e., data heterogeneity of devices and limited bandwidth. Many papers have investigated device scheduling strategies considering the two issues. However, most of them recognize data heterogeneity as a property of individual devices. In this paper, we prove that the convergence speed of FL is affected by the sum of device-level and sample-level collective gradient divergence (CGD). The device-level CGD refers to the gradient divergence of the scheduled device group, instead of the sum of the individual device divergence. The sample-level CGD is statistically upper bounded by sampling variance, which is inversely proportional to the total number of samples scheduled for local update. To derive a tractable form of the device-level CGD, we further consider a classification problem and transform it into the weighted earth moving distance (WEMD) between the group distribution and the global distribution. Then we propose FedCGD algorithm to minimize the sum of multi-level CGDs by balancing WEMD and sampling variance, within polynomial time. Simulation shows that the proposed strategy increases classification accuracy on the CIFAR-10 dataset by up to 4.2\% while scheduling 41.8\% fewer devices, and flexibly switches between reducing WEMD and reducing sampling variance.
中文摘要:本文提出FedCGD算法,通过最小化设备级和样本级的集体梯度差异来优化联邦学习收敛性能,在减少调度设备的同时显著提升分类准确率。
English Summary: This paper introduces FedCGD, a federated learning algorithm that enhances model convergence by minimizing both device-level and sample-level collective gradient divergence, achieving higher accuracy with fewer devices.
Authors:Jintao Yan, Tan Chen, Yuxuan Sun, Zhaojun Nan, Sheng Zhou, Zhisheng Niu
Abstract:
Asynchronous Federated Learning (AFL) enables distributed model training across multiple mobile devices, allowing each device to independently update its local model without waiting for others. However, device mobility introduces intermittent connectivity, which necessitates gradient sparsification and leads to model staleness, jointly affecting AFL convergence. This paper develops a theoretical model to characterize the interplay among sparsification, model staleness and mobility-induced contact patterns, and their joint impact on AFL convergence. Based on the analysis, we propose a mobility-aware dynamic sparsification (MADS) algorithm that optimizes the sparsification degree based on contact time and model staleness. Closed-form solutions are derived, showing that under low-speed conditions, MADS increases the sparsification degree to enhance convergence, while under high-speed conditions, it reduces the sparsification degree to guarantee reliable uploads within limited contact time. Experimental results validate the theoretical findings. Compared with the state-of-the-art benchmarks, the MADS algorithm increases the image classification accuracy on the CIFAR-10 dataset by 8.76% and reduces the average displacement error in the Argoverse trajectory prediction dataset by 9.46%.
中文摘要:本文提出了一种移动感知动态稀疏化(MADS)算法,通过根据设备接触时间和模型陈旧度优化梯度稀疏化程度,有效解决了异步联邦学习中因设备移动性导致的收敛问题,在图像分类和轨迹预测任务中均实现了显著性能提升。
English Summary: This paper introduces a mobility-aware dynamic sparsification (MADS) algorithm that optimizes gradient sparsification in asynchronous federated learning to counteract model staleness caused by device mobility, achieving significant performance improvements in both image classification and trajectory prediction tasks.
Authors:Hardik Parwana, Taekyung Kim, Kehan Long, Bardh Hoxha, Hideki Okamoto, Georgios Fainekos, Dimitra Panagou
Abstract:
Model Predictive Path Integral (MPPI) controller is used to solve unconstrained optimal control problems and Control Barrier Function (CBF) is a tool to impose strict inequality constraints, a.k.a, barrier constraints. In this work, we propose an integration of these two methods that employ CBF-like conditions to guide the control sampling procedure of MPPI. CBFs provide an inequality constraint restricting the rate of change of barrier functions by a classK function of the barrier itself. We instead impose the CBF condition as an equality constraint by choosing a parametric linear classK function and treating this parameter as a state in an augmented system. The time derivative of this parameter acts as an additional control input that is designed by MPPI. A cost function is further designed to reignite Nagumo's theorem at the boundary of the safe set by promoting specific values of classK parameter to enforce safety. Our problem formulation results in an MPPI subject to multiple state and control-dependent equality constraints which are non-trivial to satisfy with randomly sampled control inputs. We therefore also introduce state transformations and control projection operations, inspired by the literature on path planning for manifolds, to resolve the aforementioned issue. We show empirically through simulations and experiments on quadrotor that our proposed algorithm exhibits better sampled efficiency and enhanced capability to operate closer to the safe set boundary over vanilla MPPI.
中文摘要:本研究将模型预测路径积分(MPPI)控制器与控制屏障函数(CBF)相结合,通过构建增广系统将CBF不等式约束转化为等式约束,并采用状态变换和控制投影技术,在保证安全性的同时显著提升了边界区域的采样效率和控制性能。
English Summary: This paper integrates Model Predictive Path Integral (MPPI) control with Control Barrier Functions (CBFs) by transforming CBF inequality constraints into equality constraints through an augmented system, employing state transformations and control projections to enhance sampling efficiency and safety near set boundaries.
Authors:Chen Xiong, Zihao Wang, Rui Zhu, Tsung-Yi Ho, Pin-Yu Chen, Jingwei Xiong, Haixu Tang, Lucila Ohno-Machado
Abstract:
Large Language Models (LLMs) have revolutionized Natural Language Processing by excelling at interpreting, reasoning about, and generating human language. However, their reliance on large-scale, often proprietary datasets poses a critical challenge: unauthorized usage of such data can lead to copyright infringement and significant financial harm. Existing dataset-inference methods typically depend on log probabilities to detect suspicious training material, yet many leading LLMs have begun withholding or obfuscating these signals. This reality underscores the pressing need for label-only approaches capable of identifying dataset membership without relying on internal model logits.
We address this gap by introducing CatShift, a label-only dataset-inference framework that capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data. If a suspicious dataset was previously seen by the model, fine-tuning on a portion of it triggers a pronounced post-tuning shift in the model's outputs; conversely, truly novel data elicits more modest changes. By comparing the model's output shifts for a suspicious dataset against those for a known non-member validation set, we statistically determine whether the suspicious set is likely to have been part of the model's original training corpus. Extensive experiments on both open-source and API-based LLMs validate CatShift's effectiveness in logit-inaccessible settings, offering a robust and practical solution for safeguarding proprietary data.
中文:CatShift是一种仅依赖标签的框架,通过分析微调后模型输出的变化来检测未经授权的数据集使用,无需内部模型数据即可提供可靠解决方案。
English: CatShift is a label-only framework that detects unauthorized dataset usage in LLMs by analyzing output shifts after fine-tuning, providing a robust solution without needing internal model data.
Authors:Chaoyi Zhu, Zaitang Li, Renyi Yang, Robert Birke, Pin-Yu Chen, Tsung-Yi Ho, Lydia Y. Chen
Abstract:
Watermarking becomes one of the pivotal solutions to trace and verify the origin of synthetic images generated by artificial intelligence models, but it is not free of risks. Recent studies demonstrate the capability to forge watermarks from a target image onto cover images via adversarial optimization without knowledge of the target generative model and watermark schemes. In this paper, we uncover a greater risk of an optimization-free and universal watermark forgery that harnesses existing regenerative diffusion models. Our proposed forgery attack, PnP (Plug-and-Plant), seamlessly extracts and integrates the target watermark via regenerating the image, without needing any additional optimization routine. It allows for universal watermark forgery that works independently of the target image's origin or the watermarking model used. We explore the watermarked latent extracted from the target image and visual-textual context of cover images as priors to guide sampling of the regenerative process. Extensive evaluation on 24 scenarios of model-data-watermark combinations demonstrates that PnP can successfully forge the watermark (up to 100% detectability and user attribution), and maintain the best visual perception. By bypassing model retraining and enabling adaptability to any image, our approach significantly broadens the scope of forgery attacks, presenting a greater challenge to the security of current watermarking techniques for diffusion models and the authority of watermarking schemes in synthetic data generation and governance.
Chinese: 本文提出PnP,一种无需优化的通用水印伪造攻击,利用再生扩散模型从目标图像中提取并植入水印到载体图像上,绕过了安全措施,对现有水印技术的有效性构成挑战。
English: This paper introduces PnP, an optimization-free universal watermark forgery attack that leverages regenerative diffusion models to extract and implant watermarks from target images onto cover images, bypassing security measures and challenging current watermarking techniques' effectiveness.
Authors:Matteo Nerini, Bruno Clerckx
Abstract:
Future wireless systems, known as gigantic multiple-input multiple-output (MIMO), are expected to enhance performance by significantly increasing the number of antennas, e.g., a few thousands. To enable gigantic MIMO overcoming the scalability limitations of digital architectures, microwave linear analog computers (MiLACs) have recently emerged. A MiLAC is a multiport microwave network that processes input microwave signals entirely in the analog domain, thereby reducing hardware costs and computational complexity of gigantic MIMO architectures. In this paper, we investigate the fundamental limits on the rate achievable in MiLAC-aided MIMO systems. We model a MIMO system employing MiLAC-aided beamforming at the transmitter and receiver, and formulate the rate maximization problem to optimize the microwave networks of the MiLACs, which are assumed lossless and reciprocal for practical reasons. Under the lossless and reciprocal constraints, we derive a global optimal solution for the microwave networks of the MiLACs in closed form. In addition, we also characterize in closed-form the capacity of MIMO systems operating MiLAC-aided beamforming. Our theoretical analysis, confirmed by numerical simulations, reveals that MiLAC-aided beamforming achieves the same capacity as digital beamforming, while significantly reducing the number of radio frequency (RF) chains, analog-to-digital converters (ADCs)/digital-to-analog converters (DACs) resolution requirements, and computational complexity.
中文: 在巨型MIMO系统中,采用微波线性模拟计算机辅助的波束成形能在保持数字波束成形容量的同时,大幅降低硬件需求和计算复杂度。
English: MiLAC-aided beamforming in gigantic MIMO systems achieves digital beamforming capacity while drastically reducing hardware requirements and computational complexity.
Authors:Xiucheng Wang, Honggang Jia, Nan Cheng
Abstract:
In this paper, a novel semantic communication framework empowered by generative artificial intelligence (GAI) is proposed, to enhance the robustness against both channel noise and transmission data distribution shifts. A theoretical foundation is established using stochastic differential equations (SDEs), from which a closed-form mapping between any signal-to-noise ratio (SNR) and the optimal denoising timestep is derived. Moreover, to address distribution mismatch, a mathematical scaling method is introduced to align received semantic features with the training distribution of the GAI. Built on this theoretical foundation, a latent diffusion model (LDM)-based semantic communication framework is proposed that combines a variational autoencoder for semantic features extraction, where a pretrained diffusion model is used for denoising. The proposed system is a training-free framework that supports zero-shot generalization, and achieves superior performance under low-SNR and out-of-distribution conditions, offering a scalable and robust solution for future 6G semantic communication systems. Experimental results demonstrate that the proposed semantic communication framework achieves state-of-the-art performance in both pixel-level accuracy and semantic perceptual quality, consistently outperforming baselines across a wide range of SNRs and data distributions without any fine-tuning or post-training.
本文提出了一种生成式人工智能赋能的语义通信框架,利用潜在扩散模型在信道噪声和数据分布偏移下实现无需训练的鲁棒性能,为6G系统展现了卓越的零样本泛化能力。
This paper introduces a generative AI-powered semantic communication framework that uses a latent diffusion model to achieve robust, training-free performance against channel noise and data distribution shifts, demonstrating superior zero-shot generalization for 6G systems.
Authors:Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
Abstract:
Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.
中文摘要:本研究发现上游安全对齐数据集与下游微调任务间的高表征相似性会显著削弱模型安全护栏,而低相似性则能提升模型鲁棒性并使有害性评分降低达10.33%,凸显了上游数据集设计对防范越狱攻击的关键作用。
English Summary: This study reveals that high representation similarity between upstream safety-alignment datasets and downstream fine-tuning tasks significantly weakens model safety guardrails, while low similarity enhances robustness and reduces harmfulness scores by up to 10.33%, emphasizing the critical role of upstream dataset design in preventing jailbreaks.
Authors:Yangyang Zhong, Ji Qi, Yuan Yao, Pengxin Luo, Yunfeng Yan, Donglian Qi, Zhiyuan Liu, Tat-Seng Chua
Abstract:
Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.
中文: 针对现有短视频文本问答数据集时长和评估范围有限的问题,我们推出了首个长视频文本问答基准TextVidBench,它包含跨领域视频、三阶段评估框架和精细标注,并提出提升模型性能的高效方法。
English: Existing short-video Text-VQA datasets are limited in duration and scope, so we introduce TextVidBench, the first long-video text QA benchmark with cross-domain videos, a three-stage evaluation framework, and fine-grained annotations, along with an efficient method to enhance model performance.
Authors:Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li
Abstract:
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.
中文: 该研究提出了Anywhere3D-Bench这一综合性3D视觉定位基准,发现现有模型在空间层级和部件层级的任务上表现欠佳,主要受限于空间推理能力和细粒度感知能力的不足。
English: The study introduces Anywhere3D-Bench, a comprehensive 3D visual grounding benchmark, revealing that current models struggle significantly with space-level and part-level tasks due to insufficient spatial reasoning and fine-grained perception capabilities.
Authors:Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Yanjie Liang, Zuming Huang, Haozhe Wang, Jun Huang, Ling Chen, Wei Chu, Yuan Qi
Abstract:
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
中文摘要:layoutRL框架通过端到端的强化学习方法,结合复合奖励机制有效解决了传统文档解析的局限性,其Infinity-Parser在多项基准测试中实现了最先进的准确性和结构保真度。
English Summary: The layoutRL framework introduces an end-to-end reinforcement learning approach with a composite reward system to overcome traditional document parsing limitations, achieving state-of-the-art accuracy and structural fidelity across multiple benchmarks through its Infinity-Parser implementation.
Authors:Kunyu Wang, Xueyang Fu, Chengzhi Cao, Chengjie Ge, Wei Zhai, Zheng-Jun Zha
Abstract:
Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain's ability to continuously absorb and generalize from ongoing experiences, our approach borrow the mechanism of the complementary learning system. Specifically, we first deploy Generative Adversarial Networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus's role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex's activity patterns triggered by hippocampal replays and the pre-existing neocortical knowledge. This comprehensive framework empowers the de-raining network to amass knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework's effectiveness. It not only facilitates continuous knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios.
Chinese: 本研究提出了一种新颖的图像去雨框架,通过模拟大脑互补学习系统,从多个数据集中逐步学习,显著提升了网络在不同真实雨景中的适应性和去雨效果。
English: This study introduces a novel framework that enhances image de-raining by progressively learning from multiple datasets, mimicking the brain's complementary learning system to improve adaptability and performance in diverse real-world conditions.
Authors:Kunyu Wang, Xueyang Fu, Xin Lu, Chengjie Ge, Chengzhi Cao, Wei Zhai, Zheng-Jun Zha
Abstract:
Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.
Chinese: 本文提出了一种高效的持续测试时自适应目标检测方法,通过敏感度引导的通道剪枝和随机激活机制,在保持优异跨域适应性能的同时,将计算开销降低了12%。
English: This paper introduces an efficient continual test-time adaptive object detection method that uses sensitivity-guided channel pruning and stochastic reactivation to reduce computational costs by 12% while maintaining superior performance across domain shifts.
Authors:Kunyu Wang, Xueyang Fu, Yuanfei Bao, Chengjie Ge, Chengzhi Cao, Wei Zhai, Zheng-Jun Zha
Abstract:
Continual Test-Time Adaptation (CTTA) aims to online adapt a pre-trained model to changing environments during inference. Most existing methods focus on exploiting target data, while overlooking another crucial source of information, the pre-trained weights, which encode underutilized domain-invariant priors. This paper takes the geometric attributes of pre-trained weights as a starting point, systematically analyzing three key components: magnitude, absolute angle, and pairwise angular structure. We find that the pairwise angular structure remains stable across diverse corrupted domains and encodes domain-invariant semantic information, suggesting it should be preserved during adaptation. Based on this insight, we propose PAID (Pairwise Angular-Invariant Decomposition), a prior-driven CTTA method that decomposes weight into magnitude and direction, and introduces a learnable orthogonal matrix via Householder reflections to globally rotate direction while preserving the pairwise angular structure. During adaptation, only the magnitudes and the orthogonal matrices are updated. PAID achieves consistent improvements over recent SOTA methods on four widely used CTTA benchmarks, demonstrating that preserving pairwise angular structure offers a simple yet effective principle for CTTA.
中文: 本文提出PAID方法,通过保持预训练权重的成对角度结构来维护领域不变的语义信息,在持续测试时适应过程中仅更新权重幅度和正交矩阵,在多个基准测试中实现了最先进的性能。
English: This paper introduces PAID, a method for Continual Test-Time Adaptation that preserves the pairwise angular structure of pre-trained weights to maintain domain-invariant semantic information, achieving state-of-the-art performance across multiple benchmarks by updating only magnitudes and orthogonal matrices during adaptation.
Authors:Yijun Yang, Zhao-Yang Wang, Qiuping Liu, Shuwen Sun, Kang Wang, Rama Chellappa, Zongwei Zhou, Alan Yuille, Lei Zhu, Yu-Dong Zhang, Jieneng Chen
Abstract:
Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.
中文摘要:医学世界模型(MeWM)通过生成模型可视化模拟未来疾病状态和治疗效果,其逆向动态模型在优化个性化治疗方案方面表现卓越,显著提升了介入医生的临床决策准确率。
English Summary: The Medical World Model (MeWM) visually simulates future disease states and treatment outcomes using generative models, enhancing clinical decision-making by optimizing treatment protocols and demonstrating superior performance in medical evaluations.
Authors:Yifei Zhou, Sergey Levine, Jason Weston, Xian Li, Sainbayar Sukhbaatar
Abstract:
Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
中文: 自我挑战框架让大语言模型自主生成高质量任务并进行训练,仅使用自生成数据即可实现性能的显著提升。
English: The Self-Challenging framework enables large language models to autonomously generate and train on high-quality tasks, achieving significant performance improvements without human-annotated data.
Authors:Amir Hussein, Sameer Khurana, Gordon Wichern, Francois G. Germain, Jonathan Le Roux
Abstract:
Effective speech representations for spoken language models must balance semantic relevance with acoustic fidelity for high-quality reconstruction. However, existing approaches struggle to achieve both simultaneously. To address this, we introduce Hierarchical Acoustic and Semantic Representation Disentanglement (HASRD, pronounced `hazard'), a framework that factorizes self-supervised learning representations into discrete semantic and acoustic tokens. HASRD assigns the semantic representation to the first codebook, while encoding acoustic residuals in subsequent codebooks. This preserves ASR performance while achieving high-quality reconstruction. Additionally, we enhance HASRD's encoder efficiency, improving ASR performance without compromising reconstruction quality. Compared to SpeechTokenizer, HASRD achieves a 44% relative WER improvement, superior reconstruction quality, and 2x lower bitrate, demonstrating its effectiveness in disentangling acoustic and semantic information.
中文: 提出的HASRD框架成功分离了语音表征中的语义和声学信息,相比现有方法在语音识别性能、重建质量和比特率方面均实现了显著提升。
English: The proposed HASRD framework effectively disentangles semantic and acoustic information in speech representations, achieving superior ASR performance, enhanced reconstruction quality, and reduced bitrate compared to existing methods.
Authors:Songtao Jiang, Yan Zhang, Yeying Jin, Zhihang Tang, Yangyang Wu, Yang Feng, Jian Wu, Zuozhu Liu
Abstract:
Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.
中文: 本文提出的分层自对比奖励方法通过视觉标记丢弃生成高质量偏好数据,并采用多级偏好优化策略,有效解决了医学视觉语言模型中的模态失准问题,仅用少量训练数据即可显著提升模型的对齐性和可信度。
English: This paper introduces Hierarchical Self-Contrastive Rewarding (HSCR), a novel method that addresses modality misalignment in Medical Vision-Language Models by generating high-quality preference data through token replacement and implementing multi-level preference optimization, significantly improving alignment and trustworthiness with minimal training data.
Authors:Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
Abstract:
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
中文: 本文提出“原则组合”(CoP)框架,通过智能体工作流将人工设定的红队测试原则自动化编排,生成高效越狱提示,在主流大语言模型中暴露出前所未有的安全风险,并将单轮攻击成功率提升高达19倍。
English: This paper introduces the Composition-of-Principles (CoP) framework, an agentic workflow that automates red-teaming for Large Language Models by orchestrating human-provided principles to generate effective jailbreak prompts, revealing unprecedented safety risks and significantly boosting attack success rates.
Authors:Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
Abstract:
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
中文摘要:提出的MoCa框架通过模态感知持续预训练和异构对比微调,将预训练视觉语言模型转化为双向多模态嵌入模型,在解决注意力机制、数据可扩展性和训练多样性限制的同时,在多个基准测试中实现了最先进的性能。
English Summary: The proposed MoCa framework transforms pre-trained Vision Language Models into bidirectional multimodal embedding models through modality-aware continual pre-training and heterogeneous contrastive fine-tuning, achieving state-of-the-art performance on benchmarks while addressing limitations in attention mechanisms, data scalability, and training diversity.
Authors:Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang
Abstract:
When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.
中文: 本文提出的元因果图作为世界模型,通过潜在元状态编码因果结构的动态变化,使因果寻求智能体能够通过好奇心驱动的探索识别并优化这些结构,实验验证了其鲁棒性和泛化能力。
English: The Meta-Causal Graph is introduced as a world model that represents shifting causal structures through latent meta states, enabling a Causality-Seeking Agent to dynamically identify and refine these structures via curiosity-driven exploration, with experiments confirming its robustness and generalization.
Authors:Shengcai Liu, Hui Ou-yang, Zhiyuan Wang, Cheng Chen, Qijun Cai, Yew-Soon Ong, Ke Tang
Abstract:
Learning the structure of Bayesian networks (BNs) from data is challenging, especially for datasets involving a large number of variables. The recently proposed divide-and-conquer (D\&D) strategies present a promising approach for learning large BNs. However, they still face a main issue of unstable learning accuracy across subproblems. In this work, we introduce the idea of employing structure learning ensemble (SLE), which combines multiple BN structure learning algorithms, to consistently achieve high learning accuracy. We further propose an automatic approach called Auto-SLE for learning near-optimal SLEs, addressing the challenge of manually designing high-quality SLEs. The learned SLE is then integrated into a D\&D method. Extensive experiments firmly show the superiority of our method over D\&D methods with single BN structure learning algorithm in learning large BNs, achieving accuracy improvement usually by 30\%$\sim$225\% on datasets involving 10,000 variables. Furthermore, our method generalizes well to datasets with many more (e.g., 30000) variables and different network characteristics than those present in the training data for learning the SLE. These results indicate the significant potential of employing (automatic learning of) SLEs for scalable BN structure learning.
Chinese: 本研究提出了一种结构学习集成(SLE)方法,通过自动化的Auto-SLE技术显著提升大规模贝叶斯网络学习的准确性和稳定性,在包含上万个变量的数据集上比现有方法精度提高30%至225%,并能良好泛化至更大规模数据。
English: This work introduces a structure learning ensemble (SLE) method to enhance the accuracy and stability of learning large Bayesian networks, with an automated approach called Auto-SLE that significantly outperforms existing methods by 30% to 225% in accuracy on datasets with up to 30,000 variables.
Authors:Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang
Abstract:
Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.
中文: 视频虚拟试穿被构建为条件视频修复任务,采用具有全三维时空注意力的扩散变换器,相比现有方法在时空一致性和服装细节保持方面表现更优。
English: Video virtual try-on is formulated as a conditional video inpainting task using a Diffusion Transformer with full 3D spatial-temporal attention, achieving superior spatial-temporal consistency and garment detail preservation compared to previous methods.
Authors:Jing Bi, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu
Abstract:
This paper introduces ACTLLM (Action Consistency Tuned Large Language Model), a novel approach for robot manipulation in dynamic environments. Traditional vision-based systems often struggle to learn visual representations that excel in both task execution and spatial reasoning, thereby limiting their adaptability in dynamic environments. ACTLLM addresses these challenges by harnessing language to craft structured scene descriptors, providing a uniform interface for both spatial understanding and task performance through flexible language instructions. Moreover, we introduce a novel action consistency constraint that aligns visual perception with corresponding actions, thereby enhancing the learning of actionable visual representations. Additionally, we have reformulated the Markov decision process for manipulation tasks into a multi-turn visual dialogue framework. This approach enables the modeling of long-term task execution with enhanced contextual relevance derived from the history of task execution. During our evaluation, ACTLLM excels in diverse scenarios, proving its effectiveness on challenging vision-based robot manipulation tasks.
中文: 本文提出的ACTLLM方法通过语言构建结构化场景描述符,引入动作一致性约束对齐视觉感知与动作,并采用多轮视觉对话框架,有效提升了机器人在动态环境中的操作能力。
English: This paper presents ACTLLM, a method that uses language-based scene descriptors and an action consistency constraint to improve robot manipulation in dynamic environments by aligning visual perception with actions and employing a multi-turn visual dialogue framework for enhanced task execution.
Authors:Lingling Cai, Kang Zhao, Hangjie Yuan, Xiang Wang, Yingya Zhang, Kejie Huang
Abstract:
The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.
中文摘要:DFVEdit是一种专为视频扩散变换器设计的高效零样本视频编辑方法,通过直接流变换免除了注意力修改和微调需求,在实现20倍推理加速和85%内存节省的同时,保持了最先进的编辑性能。
English Summary: DFVEdit is an efficient zero-shot video editing method for Video DiTs that eliminates attention modification and fine-tuning through direct flow transformation, achieving 20x faster inference and 85% memory reduction while maintaining state-of-the-art performance.
Authors:Chaojun Ni, Jie Li, Haoyun Li, Hengyu Liu, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Boyuan Wang, Chenxin Li, Guan Huang, Wenjun Mei
Abstract:
Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address this challenge, we propose WonderFree, the first model that enables users to interactively generate 3D worlds with the freedom to explore from arbitrary angles and directions. Specifically, we decouple this challenge into two key subproblems: novel view quality, which addresses visual artifacts and floating issues in novel views, and cross-view consistency, which ensures spatial consistency across different viewpoints. To enhance rendering quality in novel views, we introduce WorldRestorer, a data-driven video restoration model designed to eliminate floaters and artifacts. In addition, a data collection pipeline is presented to automatically gather training data for WorldRestorer, ensuring it can handle scenes with varying styles needed for 3D scene generation. Furthermore, to improve cross-view consistency, we propose ConsistView, a multi-view joint restoration mechanism that simultaneously restores multiple perspectives while maintaining spatiotemporal coherence. Experimental results demonstrate that WonderFree not only enhances rendering quality across diverse viewpoints but also significantly improves global coherence and consistency. These improvements are confirmed by CLIP-based metrics and a user study showing a 77.20% preference for WonderFree over WonderWorld enabling a seamless and immersive 3D exploration experience. The code, model, and data will be publicly available.
中文:WonderFree作为首个支持任意角度交互探索的3D场景生成模型,通过WorldRestorer消除新视角伪影提升渲染质量,并采用ConsistView机制保持多视角时空一致性,显著提升了沉浸式探索体验。
English: WonderFree is a pioneering model that enables interactive 3D scene exploration from any angle by addressing novel view quality with WorldRestorer to eliminate artifacts and ensuring cross-view consistency through ConsistView for coherent multi-view restoration.
Authors:Renyi Zhong, Yintong Huo, Wenwei Gu, Jinxi Kuang, Zhihan Jiang, Guangba Yu, Yichen Li, David Lo, Michael R. Lyu
Abstract:
Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutions, weakening their practical effectiveness. In this study, we first conduct a quantitative analysis of existing datasets, revealing a substantial portion of sampled data are mislabeled. To address these data limitations, we introduce CCIBench, a refined dataset comprising high-quality data, to support the training and evaluation of method-level CCI methods. Furthermore, we present an innovative end-to-end LLM-based framework, CCISolver, designed to improve code quality by identifying and rectifying CCIs. Comprehensive evaluations demonstrate CCISolver's superior performance. For detection, it establishes a new state-of-the-art with an F1-score of 89.54%. In fixing task, it achieves a remarkable 18.84% relative improvement in GLEU score over the strongest baseline. This superiority is confirmed by human evaluation, where CCISolver's fixing success rate of 0.6533 significantly surpasses existing methods. Critically, in a practical end-to-end setting, CCISolver's innovative architecture is approximately 36% faster for inference than the baseline model, underscoring its scalability and real-world applicability.
中文: 本研究提出了高质量数据集CCIBench和创新性LLM框架CCISolver,在检测和修复代码注释不一致方面显著优于现有方法,同时展现出更快的速度和更强的实际应用价值。
English: This study introduces CCIBench, a high-quality dataset, and CCISolver, an innovative LLM-based framework that significantly outperforms existing methods in detecting and fixing code-comment inconsistencies while demonstrating superior speed and practical applicability.
Authors:Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, Jun Song, Yuning Jiang, Bo Zheng
Abstract:
Vision-language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent's dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs interactive multi-turn reinforcement learning with task-level rewards for mobile agents. Our training framework consists of three stages: initial format finetuning, single-step online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of Mobile-R1, leading to significant performance improvements. Moreover, we have collected a dataset covering 28 Chinese applications with 24,521 high-quality manual annotations and established a new benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
中文: Mobile-R1采用基于任务级奖励的三阶段交互式强化学习框架,显著提升移动智能体的探索与纠错能力,并通过开源数据集和基准测试验证其优越性能。
English: Mobile-R1 introduces a three-stage interactive reinforcement learning framework using task-level rewards to enhance mobile agents' exploration and error correction, outperforming existing methods and supported by a new open-source dataset and benchmark.
Authors:Zhengxiang Huang, Chaoyue Niu, Zhaode Wang, Jiarui Xue, Hanming Zhang, Yugang Wang, Zewei Xin, Xiaotang Jiang, Chengfei Lv, Fan Wu, Guihai Chen
Abstract:
As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including llama.cpp, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average.
中文摘要:MNN-AECS系统通过动态选择低功耗CPU核心,在不影响解码速度的前提下将LLM解码能耗降低23%,相比其他引擎实现了39%-78%的节能效果和12%-363%的速度提升。
English Summary: MNN-AECS is an energy-efficient system that reduces LLM decoding energy by 23% without performance loss by dynamically selecting low-power CPU cores, outperforming other engines with significant energy savings and speed improvements.
Authors:Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno
Abstract:
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.
中文: PicoSAM2是一款专为实时设备端执行优化的轻量级可提示分割模型,在索尼IMX500等边缘设备上实现高效性能,并通过消除云端处理有效保护隐私。
English: PicoSAM2 is a lightweight, promptable segmentation model optimized for real-time, on-device execution, achieving efficient performance on edge devices like the Sony IMX500 while preserving privacy by eliminating cloud processing.
Authors:Luoyang Sun, Cheng Deng, Jiwen Jiang, Xinjian Wu, Haifeng Zhang, Lei Chen, Lionel Ni, Jun Wang
Abstract:
Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbf{G}rouped-Head Laten\textbf{T} \textbf{A}ttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph{62.5\%} versus Grouped-Query Attention and shrink the KV cache by up to \emph{70\%}, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph{2x} increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.
中文摘要:GTA是一种新型注意力机制,通过压缩KV缓存和在多个注意力头间复用注意力分数,显著降低大语言模型的计算复杂度和内存占用,在保持性能的同时实现高达2倍的推理加速。
English Summary: GTA is a novel attention mechanism that reduces computational complexity and memory usage in large language models by compressing the KV cache and reusing attention scores across heads, achieving up to 2x faster inference speed while maintaining performance.
Authors:Haina Qin, Wenyang Luo, Libin Wang, Dandan Zheng, Jingdong Chen, Ming Yang, Bing Li, Weiming Hu
Abstract:
Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications.
Chinese: ResFlow是一种新颖的图像恢复框架,通过连续归一化流将退化过程建模为确定性路径,在少于四个采样步骤内高效实现了最先进的性能。
English: ResFlow is a novel image restoration framework that models degradation as a deterministic path using continuous normalizing flows, achieving state-of-the-art results with high efficiency in fewer than four sampling steps.
Authors:Liang Qin, Weiwei Wan, Jun Takahashi, Ryo Negishi, Masaki Matsushita, Kensuke Harada
Abstract:
This work proposes a learning method to accelerate robotic pick-and-place planning by predicting shared grasps. Shared grasps are defined as grasp poses feasible to both the initial and goal object configurations in a pick-and-place task. Traditional analytical methods for solving shared grasps evaluate grasp candidates separately, leading to substantial computational overhead as the candidate set grows. To overcome the limitation, we introduce an Energy-Based Model (EBM) that predicts shared grasps by combining the energies of feasible grasps at both object poses. This formulation enables early identification of promising candidates and significantly reduces the search space. Experiments show that our method improves grasp selection performance, offers higher data efficiency, and generalizes well to unseen grasps and similarly shaped objects.
中文: 本研究提出一种基于能量的学习方法,通过预测共享抓取来加速机器人抓放规划,有效缩小搜索空间,实验表明该方法提升了抓取性能、数据效率并具有良好的泛化能力。
English: This study introduces an energy-based learning method that predicts shared grasps to accelerate robotic pick-and-place planning by efficiently narrowing the search space, demonstrating improved performance, data efficiency, and generalization in experiments.
Authors:Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu
Abstract:
Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.
中文: 本文提出了一种免重构的在线开放词汇3D物体检测框架,通过融合预训练视觉基础模型与多视角优化技术,在保持实时性的同时实现了大规模场景下的最优检测性能。
English: This paper introduces a reconstruction-free online framework for open-vocabulary 3D object detection that leverages pre-trained visual foundation models and multi-view fusion techniques to achieve state-of-the-art performance while enabling real-time operation in large-scale environments.
Authors:David Dembinsky, Adriano Lucieri, Stanislav Frolov, Hiba Najjar, Ko Watanabe, Andreas Dengel
Abstract:
Modern AI systems frequently rely on opaque black-box models, most notably Deep Neural Networks, whose performance stems from complex architectures with millions of learned parameters. While powerful, their complexity poses a major challenge to trustworthiness, particularly due to a lack of transparency. Explainable AI (XAI) addresses this issue by providing human-understandable explanations of model behavior. However, to ensure their usefulness and trustworthiness, such explanations must be rigorously evaluated. Despite the growing number of XAI methods, the field lacks standardized evaluation protocols and consensus on appropriate metrics. To address this gap, we conduct a systematic literature review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and introduce a unified framework for the eValuation of XAI (VXAI). We identify 362 relevant publications and aggregate their contributions into 41 functionally similar metric groups. In addition, we propose a three-dimensional categorization scheme spanning explanation type, evaluation contextuality, and explanation quality desiderata. Our framework provides the most comprehensive and structured overview of VXAI to date. It supports systematic metric selection, promotes comparability across methods, and offers a flexible foundation for future extensions.
中文: 现代AI系统常依赖不透明的黑盒模型,如深度神经网络,其复杂性导致可信度问题,可解释AI(XAI)通过提供人类可理解的解释来解决此问题,但缺乏标准化评估;本文通过系统文献综述引入统一的VXAI框架,对指标进行分类以提升可比性。
English: Modern AI systems often use opaque black-box models like Deep Neural Networks, which lack transparency and challenge trustworthiness, prompting the development of Explainable AI (XAI) to provide understandable explanations, yet standardized evaluation is lacking; this paper introduces a unified VXAI framework through a systematic review to categorize metrics and enhance comparability.
Authors:Tianxiang Zhan, Ming Jin, Yuanpeng He, Yuxuan Liang, Yong Deng, Shirui Pan
Abstract:
Recurring concept drift, a type of concept drift in which previously observed data patterns reappear after some time, is one of the most prevalent types of concept drift in time series. As time progresses, concept drift occurs and previously encountered concepts are forgotten, thereby leading to a decline in the accuracy of online predictions. Existing solutions employ parameter updating techniques to delay forgetting; however, this may result in the loss of some previously learned knowledge while neglecting the exploration of knowledge retention mechanisms. To retain all conceptual knowledge and fully utilize it when the concepts recur, we propose the Continuous Evolution Pool (CEP), a pooling mechanism that stores different instances of forecasters for different concepts. Our method first selects the forecaster nearest to the test sample and then learns the features from its neighboring samples - a process we refer to as the retrieval. If there are insufficient neighboring samples, it indicates that a new concept has emerged, and a new model will evolve from the current nearest sample to the pool to store the knowledge of the concept. Simultaneously, the elimination mechanism will enable outdated knowledge to be cleared to ensure the prediction effect of the forecasters. Experiments on different architectural models and eight real datasets demonstrate that CEP effectively retains the knowledge of different concepts. In the scenario of online forecasting with recurring concepts, CEP significantly enhances the prediction results.
中文: 针对时间序列中反复出现的概念漂移问题,本文提出的持续演化池机制通过存储不同概念的预测器、动态检索或演化模型并清除过时知识,显著提升了在线预测的准确性。
English: The proposed Continuous Evolution Pool (CEP) mechanism addresses recurring concept drift in time series by storing specialized forecasters for different concepts, dynamically retrieving or evolving models based on data patterns while clearing outdated knowledge to significantly enhance online prediction accuracy.
Authors:Chengrui Zhang, Maizhen Ning, Zihao Zhou, Jie Sun, Kaizhu Huang, Qiufeng Wang
Abstract:
Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on computer tools (e.g., Matplotlib and GeoGebra) to manually generate precise diagrams, but it usually requires huge, complicated calculations cost. Recently, researchers start to work on learning-based methods (e.g., Stable Diffusion and GPT4) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements in the SDF, then construct a series of constraint functions to represent geometric relationships, next we optimize such constraint functions to get an optimized field of both elements and constraints, finally by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to easily represent geometric elements and those constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, our GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. Through both qualitative and quantitative analysis, we can see that synthesized diagrams are realistic and accurate, and our synthesizing process is simple and efficient. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95\% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
Chinese: 本文提出GeoSDF框架,利用符号距离场自动高效生成逼真且精确的几何图形,并通过自验证特性在几何问题求解中实现高准确率。
English: The paper introduces GeoSDF, a framework using Signed Distance Fields to automatically generate realistic and accurate geometry diagrams efficiently, achieving high problem-solving accuracy through self-verification.
Authors:Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti, Christian Kroer
Abstract:
We study online decision making problems under resource constraints, where both reward and cost functions are drawn from distributions that may change adversarially over time. We focus on two canonical settings: $(i)$ online resource allocation where rewards and costs are observed before action selection, and $(ii)$ online learning with resource constraints where they are observed after action selection, under full feedback or bandit feedback. It is well known that achieving sublinear regret in these settings is impossible when reward and cost distributions may change arbitrarily over time. To address this challenge, we analyze a framework in which the learner is guided by a spending plan--a sequence prescribing expected resource usage across rounds. We design general (primal-)dual methods that achieve sublinear regret with respect to baselines that follow the spending plan. Crucially, the performance of our algorithms improves when the spending plan ensures a well-balanced distribution of the budget across rounds. We additionally provide a robust variant of our methods to handle worst-case scenarios where the spending plan is highly imbalanced. To conclude, we study the regret of our algorithms when competing against benchmarks that deviate from the prescribed spending plan.
中文: 本研究针对对抗性资源约束下的在线决策问题,通过引入资源使用计划指导资源分配,采用原始-对偶方法实现次线性遗憾,在预算均衡时表现最优,并包含应对最坏情况的鲁棒变体。
English: This research addresses online decision-making under adversarial resource constraints by introducing spending plans to guide resource allocation, enabling sublinear regret through primal-dual methods that perform best with balanced budgets and include robust variants for worst-case scenarios.
Authors:Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, Xuelong Li
Abstract:
Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.
中文摘要:本文提出了一种基于物理的人形机器人控制框架,通过多步骤运动处理和自适应运动跟踪来掌握功夫等高度动态的人类行为,在Unitree G1机器人上实现了卓越性能。
English Summary: This paper introduces a physics-based humanoid control framework that masters highly-dynamic human motions like Kungfu through multi-step motion processing and adaptive tracking, achieving superior performance on the Unitree G1 robot.
Authors:Xuqian Xue, Yiming Lei, Qi Cai, Hongming Shan, Junping Zhang
Abstract:
While contrastive multi-view clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer performance degradation due to their inability to perceive and model such imbalance. To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems: i. perceiving class imbalance distribution, and ii. mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering. First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport (POT) problem, augmented with progressive mass constraints and weighted KL divergence for class distributions. Second, we develop a POT-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporating logit adjustment and class-sensitive learning to enhance minority sample representations. Extensive experiments demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field.
中文摘要:本研究提出PROTOCOL框架,通过部分最优传输和重平衡对比学习解决多视图聚类中的类别不平衡问题,有效提升少数类样本的表征能力并显著改善聚类性能。
English Summary: This study introduces PROTOCOL, a novel framework addressing class imbalance in multi-view clustering through partial optimal transport and rebalanced contrastive learning to enhance minority sample representations and improve clustering performance.
Authors:Qirui Mi, Qipeng Yang, Zijun Fan, Wentian Fan, Heyang Ma, Chengdong Ma, Siyu Xia, Bo An, Jun Wang, Haifeng Zhang
Abstract:
Artificial intelligence (AI) has become a powerful tool for economic research, enabling large-scale simulation and policy optimization. However, applying AI effectively requires simulation platforms for scalable training and evaluation-yet existing environments remain limited to simplified, narrowly scoped tasks, falling short of capturing complex economic challenges such as demographic shifts, multi-government coordination, and large-scale agent interactions. To address this gap, we introduce EconGym, a scalable and modular testbed that connects diverse economic tasks with AI algorithms. Grounded in rigorous economic modeling, EconGym implements 11 heterogeneous role types (e.g., households, firms, banks, governments), their interaction mechanisms, and agent models with well-defined observations, actions, and rewards. Users can flexibly compose economic roles with diverse agent algorithms to simulate rich multi-agent trajectories across 25+ economic tasks for AI-driven policy learning and analysis. Experiments show that EconGym supports diverse and cross-domain tasks-such as coordinating fiscal, pension, and monetary policies-and enables benchmarking across AI, economic methods, and hybrids. Results indicate that richer task composition and algorithm diversity expand the policy space, while AI agents guided by classical economic methods perform best in complex settings. EconGym also scales to 10k agents with high realism and efficiency.
中文:EconGym作为一个可扩展的模块化测试平台,将AI算法与复杂经济任务相连接,通过多智能体模拟支持跨领域政策学习与分析。
English: EconGym is a scalable, modular testbed that bridges AI algorithms with complex economic tasks, enabling multi-agent simulations across diverse scenarios to enhance policy learning and analysis.
Authors:Tianrui Zhu, Houyuan Chen, Ruihao Gong, Michele Magno, Haotong Qin, Kai Zhang
Abstract:
Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model's ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.
中文摘要:本文提出了一种新颖的视频抠图后训练量化框架,通过结合块重建优化、全局仿射校准和光流辅助技术,在显著降低计算成本的同时实现了接近全精度的性能表现。
English Summary: This paper introduces a novel post-training quantization framework for video matting that combines block-reconstruction optimization, global affine calibration, and optical flow assistance to achieve near full-precision performance while significantly reducing computational costs.
Authors:Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen, YuKun Zhou, Jiancheng Lv, Xingang Wang, Guan Huang
Abstract:
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
中文: GigaVideo-1是一种高效的微调框架,通过自动反馈和奖励引导策略,无需人工监督即可提升视频生成质量,在多项维度上取得显著改进且计算资源消耗极低。
English: GigaVideo-1 is an efficient fine-tuning framework that enhances video generation quality by leveraging automatic feedback and a reward-guided strategy without human supervision, achieving significant improvements across multiple dimensions with minimal computational resources.
Authors:Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Abstract:
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
中文: 科学家初试(SFE)基准通过三个相互关联的层面评估多模态大语言模型的科学认知能力,实验表明当前顶尖模型表现有限,凸显了在科学领域提升的巨大空间。
English: The Scientists' First Exam (SFE) benchmark is introduced to assess multimodal large language models' scientific cognitive abilities across perception, understanding, and reasoning, revealing current models' limited performance and the need for improvement in scientific applications.
Authors:Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, Qing Wang
Abstract:
Automatically repairing software issues remains a fundamental challenge at the intersection of software engineering and AI. Although recent advancements in Large Language Models (LLMs) have demonstrated potential for repository-level repair tasks, current methodologies exhibit two notable limitations: (1) they often address issues in isolation, neglecting to incorporate insights from previously resolved issues, and (2) they rely on static and rigid prompting strategies, which constrain their ability to generalize across diverse and evolving issue scenarios. Inspired by the dual memory systems of human cognition, where episodic and semantic memories work synergistically to support human reasoning and decision-making, we propose ExpeRepair, a novel LLM-based approach that continuously learns from historical repair experiences through dual-channel knowledge accumulation. ExpeRepair organizes historical repair experiences into two complementary memories: an episodic memory that stores concrete repair demonstrations, and a semantic memory that encodes abstract reflective insights. At inference time, ExpeRepair activates both memory systems by retrieving relevant demonstrations from episodic memory and recalling high-level repair insights from semantic memory. It further enhances adaptability through dynamic prompt composition, synergistically integrating both memory types to replace static prompts with context-aware, experience-driven prompts. Experiments on the SWE-bench Lite benchmark demonstrate that ExpeRepair achieves a pass@1 score of 49.3% with Claude 3.7 Sonnet, outperforming all state-of-the-art open-source methods.
Chinese: ExpeRepair提出了一种新颖的基于大语言模型的软件修复方法,通过双通道记忆系统——情景记忆存储具体修复案例,语义记忆编码抽象洞见——动态整合历史修复经验,在SWE-bench Lite基准测试中以49.3%的pass@1分数实现最优性能。
English: ExpeRepair introduces a novel LLM-based software repair approach that leverages dual-channel memory systems—episodic for concrete demonstrations and semantic for abstract insights—to dynamically integrate historical repair experiences, achieving state-of-the-art performance with a 49.3% pass@1 score on SWE-bench Lite.
Authors:Ali Vosoughi, Jing Bi, Pinxin Liu, Yunlong Tang, Chenliang Xu
Abstract:
What happens when we push audio-visual alignment to its absolute limits? To systematically investigate this question, we needed datasets with granular alignment quality annotations, but existing datasets treat alignment as binary, either synchronized or not. To address this limitation, we developed a comprehensive dataset featuring detailed alignment scores that reveal the hidden spectrum of audio-visual perceptual correspondence. Using these precise scores, we create "superaligned" representations by training exclusively on the most perfectly matched audio-visual pairs, then conduct our systematic investigation into how this extreme alignment transforms perceptual model behavior across retrieval and generation tasks. The encoders under study fall into two main groups consisting of image-centric encoders that were pretrained using visual modalities as intermediary hubs for connecting modalities, and text-centric encoders that were pretrained with direct audio-language alignment. We first measure the baseline performance of these encoders on two key tasks, namely cross-modal retrieval and text description generation in vision-language models. Subsequently, we realign all encoders with the CLIP space using highly coherent audio-visual data and observe the performance changes. Our findings reveal that the initial architectural type of the encoder determines how it responds to the alignment process. Image-centric encoders, which are inherently designed for alignment, demonstrate exceptional performance in cross-modal retrieval, but this intensive alignment causes compression of unique linguistic information and reduces the quality of their text description generation in vision-language models. In contrast, text-centric encoders, which possess stronger linguistic authenticity, are able to maintain a better balance between the two objectives.
中文摘要:本研究通过使用完美匹配的音频-视觉对创建“超对齐”表征,探究了极端对齐的影响,发现以图像为中心的编码器在跨模态检索中表现出色但会降低文本生成质量,而以文本为中心的编码器则能更好地平衡这两项任务。
English Summary: This study investigates the effects of extreme audio-visual alignment by creating "superaligned" representations using perfectly matched pairs, revealing that image-centric encoders excel in cross-modal retrieval but compromise text generation quality, while text-centric encoders maintain better balance between these tasks.
Authors:Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie
Abstract:
Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/
中文摘要:本文提出Mirage-1多模态GUI代理,通过分层技能模块和搜索算法解决在线长程任务中的知识不足与领域差距问题,在多个基准测试中展现出显著性能优势。
English Summary: This paper introduces Mirage-1, a multimodal GUI agent enhanced with a hierarchical skill module and search algorithm to overcome knowledge limitations and domain gaps in long-horizon online tasks, demonstrating superior performance across multiple benchmarks.
Authors:Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie
Abstract:
Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/
中文: 本文提出通用 Minecraft 智能体 Optimus-3,通过知识增强数据生成、专家混合架构和多模态推理增强强化学习解决领域数据不足与任务干扰等难题,在多种任务中超越现有最优模型。
English: This paper introduces Optimus-3, a general-purpose Minecraft agent that overcomes challenges like data scarcity and task interference through a knowledge-enhanced data pipeline, Mixture-of-Experts architecture, and multimodal reasoning-augmented reinforcement learning, outperforming existing models in diverse tasks.
Authors:Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, Xingang Wang
Abstract:
Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
Chinese: Motion-R1通过思维链机制将文本指令分解为结构化动作路径,结合语义引导和强化学习优化运动生成,在复杂场景中实现了卓越性能。
English: Motion-R1 introduces a Chain-of-Thought framework that decomposes text instructions into structured action paths, enhancing motion generation through semantic guidance and reinforcement learning, achieving superior performance in complex scenarios.
Authors:Nicola Farronato, Florian Scheidegger, Mattia Rigotti, Cristiano Malossi, Michele Magno, Haotong Qin
Abstract:
The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.
Chinese: 针对Segment Anything Model 2 (SAM2)的计算效率问题,本文提出Q-SAM2量化方法,通过创新的校准和训练技术,在显著提升效率的同时保持了高精度性能。
English: The Segment Anything Model 2 (SAM2) faces computational challenges, so this paper introduces Q-SAM2, an efficient quantization method that enhances accuracy and performance through innovative calibration and training techniques.
Authors:Dewei Wang, Xinmiao Wang, Xinzhe Liu, Jiyuan Shi, Yingnan Zhao, Chenjia Bai, Xuelong Li
Abstract:
Humanoid robots have demonstrated robust locomotion capabilities using Reinforcement Learning (RL)-based approaches. Further, to obtain human-like behaviors, existing methods integrate human motion-tracking or motion prior in the RL framework. However, these methods are limited in flat terrains with proprioception only, restricting their abilities to traverse challenging terrains with human-like gaits. In this work, we propose a novel framework using a mixture of latent residual experts with multi-discriminators to train an RL policy, which is capable of traversing complex terrains in controllable lifelike gaits with exteroception. Our two-stage training pipeline first teaches the policy to traverse complex terrains using a depth camera, and then enables gait-commanded switching between human-like gait patterns. We also design gait rewards to adjust human-like behaviors like robot base height. Simulation and real-world experiments demonstrate that our framework exhibits exceptional performance in traversing complex terrains, and achieves seamless transitions between multiple human-like gait patterns.
中文摘要:本研究提出了一种新颖的强化学习框架,通过潜在残差专家混合模型和多判别器结构,结合外部深度感知和两阶段训练方法,使人形机器人能够在复杂地形上实现可控的拟人步态行走与平滑切换。
English Summary: This study introduces a novel reinforcement learning framework using latent residual experts and multi-discriminators that enables humanoid robots to traverse complex terrains with lifelike, controllable gaits through two-stage training incorporating exteroceptive depth sensing.
Authors:Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan
Abstract:
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/
中文: BridgeVLA是一种创新的三维视觉语言动作模型,通过将三维数据投影为二维图像并利用热图进行动作预测,显著提升了机器人操作的效率与性能,在仿真和实际任务中均表现卓越。
English: BridgeVLA is a novel 3D vision-language-action model that enhances robot manipulation by projecting 3D data into 2D images and using heatmaps for action prediction, achieving superior efficiency and performance across simulation and real-world tasks.
Authors:Haotong Qin, Cheng Hu, Michele Magno
Abstract:
Large Language Model (LLM)-based Vision-Language Models (VLMs) have substantially extended the boundaries of visual understanding capabilities. However, their high computational demands hinder deployment on resource-constrained edge devices. A key source of inefficiency stems from the VLM's need to process dense and redundant visual information. Visual inputs contain significant regions irrelevant to text semantics, rendering the associated computations ineffective for inference. This paper introduces a novel Event-Priori-Based Vision-Language Model, termed EP-VLM. Its core contribution is a novel mechanism leveraging motion priors derived from dynamic event vision to enhance VLM efficiency. Inspired by human visual cognition, EP-VLM first employs event data to guide the patch-wise sparsification of RGB visual inputs, progressively concentrating VLM computation on salient regions of the visual input. Subsequently, we construct a position-preserving tokenization strategy for the visual encoder within the VLM architecture. This strategy processes the event-guided, unstructured, sparse visual input while accurately preserving positional understanding within the visual input. Experimental results demonstrate that EP-VLM achieves significant efficiency improvements while maintaining nearly lossless accuracy compared to baseline models from the Qwen2-VL series. For instance, against the original Qwen2-VL-2B, EP-VLM achieves 50% FLOPs savings while retaining 98% of the original accuracy on the RealWorldQA dataset. This work demonstrates the potential of event-based vision priors for improving VLM inference efficiency, paving the way for creating more efficient and deployable VLMs for sustainable visual understanding at the edge.
中文: EP-VLM提出了一种基于事件先验的机制,利用运动数据对视觉输入进行稀疏化处理,在保持98%准确率的同时,相比基线模型实现了50%的计算量节省。
English: EP-VLM introduces an event-prior-based mechanism that uses motion data to sparsify visual inputs, achieving 50% computational savings while maintaining 98% accuracy compared to baseline models.
Authors:Yun Hua, Haosheng Chen, Shiqin Wang, Wenhao Li, Xiangfeng Wang, Jun Luo
Abstract:
Large Language Models (LLMs) show strong collaborative performance in multi-agent systems with predefined roles and workflows. However, in open-ended environments lacking coordination rules, agents tend to act in self-interested ways. The central challenge in achieving coordination lies in credit assignment -- fairly evaluating each agent's contribution and designing pricing mechanisms that align their heterogeneous goals. This problem is critical as LLMs increasingly participate in complex human-AI collaborations, where fair compensation and accountability rely on effective pricing mechanisms. Inspired by how human societies address similar coordination challenges (e.g., through temporary collaborations such as employment or subcontracting), we propose a cooperative workflow, Shapley-Coop. Shapley-Coop integrates Shapley Chain-of-Thought -- leveraging marginal contributions as a principled basis for pricing -- with structured negotiation protocols for effective price matching, enabling LLM agents to coordinate through rational task-time pricing and post-task reward redistribution. This approach aligns agent incentives, fosters cooperation, and maintains autonomy. We evaluate Shapley-Coop across two multi-agent games and a software engineering simulation, demonstrating that it consistently enhances LLM agent collaboration and facilitates equitable credit assignment. These results highlight the effectiveness of Shapley-Coop's pricing mechanisms in accurately reflecting individual contributions during task execution.
中文: Shapley-Coop是一种协作工作流程,通过理性定价和任务后奖励再分配,使LLM智能体在开放环境中有效协调,提升合作并确保公平的贡献评估。
English: Shapley-Coop is a cooperative workflow that enables LLM agents to coordinate through rational pricing and reward redistribution, enhancing collaboration and ensuring fair credit assignment in open-ended environments.
Authors:Keyi Zhu, Kyle Lammers, Kaixiang Zhang, Chaaran Arunachalam, Siddhartha Bhattacharya, Jiajia Li, Renfu Lu, Zhaojian Li
Abstract:
Apples are among the most widely consumed fruits worldwide. Currently, apple harvesting fully relies on manual labor, which is costly, drudging, and hazardous to workers. Hence, robotic harvesting has attracted increasing attention in recent years. However, existing systems still fall short in terms of performance, effectiveness, and reliability for complex orchard environments. In this work, we present the development and evaluation of a dual-arm harvesting robot. The system integrates a ToF camera, two 4DOF robotic arms, a centralized vacuum system, and a post-harvest handling module. During harvesting, suction force is dynamically assigned to either arm via the vacuum system, enabling efficient apple detachment while reducing power consumption and noise. Compared to our previous design, we incorporated a platform movement mechanism that enables both in-out and up-down adjustments, enhancing the robot's dexterity and adaptability to varying canopy structures. On the algorithmic side, we developed a robust apple localization pipeline that combines a foundation-model-based detector, segmentation, and clustering-based depth estimation, which improves performance in orchards. Additionally, pressure sensors were integrated into the system, and a novel dual-arm coordination strategy was introduced to respond to harvest failures based on sensor feedback, further improving picking efficiency. Field demos were conducted in two commercial orchards in MI, USA, with different canopy structures. The system achieved success rates of 0.807 and 0.797, with an average picking cycle time of 5.97s. The proposed strategy reduced harvest time by 28% compared to a single-arm baseline. The dual-arm harvesting robot enhances the reliability and efficiency of apple picking. With further advancements, the system holds strong potential for autonomous operation and commercialization for the apple industry.
中文: 本研究开发了一种集成先进视觉系统与动态真空吸附的双臂苹果采摘机器人,田间试验表明其采摘成功率高且周期显著缩短。
English: This study introduces a dual-arm apple harvesting robot that integrates advanced vision systems and dynamic vacuum suction, achieving high success rates and reduced picking times in field tests.
Authors:Xinyu Cui, Boai Sun, Yi Zhu, Ning Yang, Haifeng Zhang, Weicheng Cui, Dixia Fan, Jun Wang
Abstract:
Aquatic organisms are known for their ability to generate efficient propulsion with low energy expenditure. While existing research has sought to leverage bio-inspired structures to reduce energy costs in underwater robotics, the crucial role of control policies in enhancing efficiency has often been overlooked. In this study, we optimize the motion of a bio-mimetic robotic fish using deep reinforcement learning (DRL) to maximize propulsion efficiency and minimize energy consumption. Our novel DRL approach incorporates extended pressure perception, a transformer model processing sequences of observations, and a policy transfer scheme. Notably, significantly improved training stability and speed within our approach allow for end-to-end training of the robotic fish. This enables agiler responses to hydrodynamic environments and possesses greater optimization potential compared to pre-defined motion pattern controls. Our experiments are conducted on a serially connected rigid robotic fish in a free stream with a Reynolds number of 6000 using computational fluid dynamics (CFD) simulations. The DRL-trained policies yield impressive results, demonstrating both high efficiency and propulsion. The policies also showcase the agent's embodiment, skillfully utilizing its body structure and engaging with surrounding fluid dynamics, as revealed through flow analysis. This study provides valuable insights into the bio-mimetic underwater robots optimization through DRL training, capitalizing on their structural advantages, and ultimately contributing to more efficient underwater propulsion systems.
中文: 本研究采用结合扩展压力感知和Transformer模型的深度强化学习方法,通过端到端训练优化仿生机器鱼运动,实现了高推进效率和对流体环境的敏捷响应。
English: This study employs deep reinforcement learning with extended pressure perception and a transformer model to optimize a robotic fish's motion, achieving high propulsion efficiency and agile responses in hydrodynamic environments through end-to-end training.
Authors:Wenwei Gu, Renyi Zhong, Guangba Yu, Xinying Sun, Jinyang Liu, Yintong Huo, Zhuangbin Chen, Jianping Zhang, Jiazhen Gu, Yongqiang Yang, Michael R. Lyu
Abstract:
To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-based approaches model these dependencies better, they often face challenges such as high computational demands and lack of interpretability.
To address these issues, KPIRoot is proposed as an efficient method combining similarity and causality analysis. It uses symbolic aggregate approximation for compact KPI representation, improving analysis efficiency. However, deployment in Cloud H revealed two drawbacks: 1) threshold-based anomaly detection misses some performance anomalies, and 2) SAX representation fails to capture intricate variation trends. KPIRoot+ addresses these limitations, outperforming eight state-of-the-art baselines by 2.9% to 35.7%, while reducing time cost by 34.7%. We also share our experience deploying KPIRoot in a large-scale cloud provider's production environment.
Chinese: 为提高云系统可靠性,KPIRoot+作为一种改进方法被提出,结合相似性和因果分析,解决了异常检测和表征的不足,在准确性和效率上均优于现有方法。
English: To enhance cloud system reliability, KPIRoot+ is introduced as an improved method that combines similarity and causality analysis, addressing limitations in anomaly detection and representation to outperform existing approaches in accuracy and efficiency.
Authors:Anders Enqvist, Ãzlem TuÄfe Demir, Cicek Cavdar, Emil Björnson
Abstract:
Reconfigurable intelligent surfaces (RISs) can greatly improve the signal quality of future communication systems by reflecting transmitted signals toward the receiver. However, even when the base station (BS) has perfect channel knowledge and can compute the optimal RIS phase-shift configuration, implementing this configuration requires feedback signaling over a control channel from the BS to the RIS. This feedback must be kept minimal, as it is transmitted wirelessly every time the channel changes. In this paper, we examine how the feedback load, measured in bits, affects the performance of an RIS-aided system. Specifically, we investigate the trade-offs between codebook-based and element-wise feedback schemes, and how these influence the signal-to-noise ratio (SNR). We propose a novel quantization codebook tailored for line-of-sight (LoS) that guarantees a minimal SNR loss using a number of feedback bits that scale logarithmically with the number of RIS elements. We demonstrate the codebook's usefulness over Rician fading channels and how to extend it to handle a non-zero static path. Numerical simulations and analytical analysis are performed to quantify the performance degradation that results from a reduced feedback load, shedding light on how efficiently RIS configurations can be fed back in practical systems.
中文: 本文研究了可重构智能表面辅助通信系统中反馈负载与性能之间的权衡,提出了一种新型量化码本,能以对数级反馈实现最小信噪比损失,并通过仿真和分析验证了其有效性。
English: This paper explores the trade-off between feedback load and performance in RIS-aided communication systems, proposing a novel quantization codebook that minimizes SNR loss with logarithmic feedback scaling and validating its effectiveness through simulations and analysis.
Authors:Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, Liqiang Nie
Abstract:
Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation. Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present \textbf{S}kill \textbf{T}raining with \textbf{A}ugmented \textbf{R}otation (\textbf{STAR}), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ). It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions. Further, to capture the causal relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation. Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12\% improvement over the baselines.
中文:STAR框架通过旋转增强量化和因果变换器解决了代码本坍塌问题并建模技能间因果关系,在机器人操作基准上实现12%的性能提升。
English: The proposed STAR framework overcomes codebook collapse and models causal skill relationships through rotation-augmented quantization and causal transformers, achieving 12% improvement on robotic manipulation benchmarks.
Authors:Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
Abstract:
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.
中文: 现有统一模型在视觉语言理解和文本生成图像方面表现优异,但在图像感知与操控能力上存在不足;为此我们开发了开源框架UniWorld-V1,通过采用多模态语义特征,仅用270万训练数据即可在图像理解、生成、操控等多项任务中实现卓越性能。
English: While current unified models excel in vision-language understanding and text-to-image generation, they fall short in advanced image perception and manipulation, prompting the development of UniWorld-V1—an open-source framework that leverages semantic features from multimodal models to achieve superior performance across diverse tasks with minimal training data.
Authors:Yusuke Sakai, Takumi Goto, Taro Watanabe
Abstract:
We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.
中文:我们提出了IMPARA-GED,这是一种无需参考文本的自动语法纠错评估新方法,具备语法错误检测功能,实验证明其在句子级别评估中与人工评价的相关性最高。
English: We introduce IMPARA-GED, a new reference-free automatic grammatical error correction evaluation method that incorporates grammatical error detection, achieving the highest correlation with human assessments in experiments.
Authors:Xidong Yang, Wenhao Li, Junjie Sheng, Chuyun Shen, Yun Hua, Xiangfeng Wang
Abstract:
Reinforcement learning (RL) has driven breakthroughs in AI, from game-play to scientific discovery and AI alignment. However, its broader applicability remains limited by challenges such as low data efficiency and poor generalizability. Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task-agnostic planning. In this work, we propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with LLMs to enhance decision-making. The AEC can leverage a large language model (LLM) to map the observations into language-grounded embeddings, which further can be stored in an episodic memory for rapid retrieval of high-value experiences. Simultaneously, a World-Graph working memory module is utilized to capture structured environmental dynamics in order to enhance relational reasoning. Furthermore, a lightweight critical state detector dynamically arbitrates between the episodic memory recall and the world-model-guided exploration. On the whole, by combining the trial-and-error learning scheme with LLM-derived semantic priors, the proposed AEC can improve both data efficiency and generalizability in reinforcement learning. In experiments on BabyAI-Text benchmark tasks, AEC demonstrates substantial improvements over existing baselines, especially on complex and generalization tasks like FindObj, where it outperforms the best baseline by up to 76%. The proposed AEC framework bridges the strengths of numeric reinforcement learning and symbolic reasoning, which provides a pathway toward more adaptable and sample-efficient agents.
Chinese: 代理情景控制(AEC)框架通过将强化学习与大语言模型结合,利用语义嵌入和情景记忆优化决策,显著提升了数据效率和泛化能力,在BabyAI-Text基准测试中复杂任务上的表现最高优于基线76%。
English: The Agentic Episodic Control (AEC) framework integrates reinforcement learning with large language models to enhance decision-making by leveraging semantic embeddings and episodic memory, significantly improving data efficiency and generalizability, as demonstrated by up to 76% performance gains in BabyAI-Text benchmark tasks.
Authors:Xingyu Wu, Kui Yu, Jibin Wu, Kay Chen Tan
Abstract:
This paper critically re-evaluates LLMs' role in causal discovery and argues against their direct involvement in determining causal relationships. We demonstrate that LLMs' autoregressive, correlation-driven modeling inherently lacks the theoretical grounding for causal reasoning and introduces unreliability when used as priors in causal discovery algorithms. Through empirical studies, we expose the limitations of existing LLM-based methods and reveal that deliberate prompt engineering (e.g., injecting ground-truth knowledge) could overstate their performance, helping to explain the consistently favorable results reported in much of the current literature. Based on these findings, we strictly confined LLMs' role to a non-decisional auxiliary capacity: LLMs should not participate in determining the existence or directionality of causal relationships, but can assist the search process for causal graphs (e.g., LLM-based heuristic search). Experiments across various settings confirm that, by strictly isolating LLMs from causal decision-making, LLM-guided heuristic search can accelerate the convergence and outperform both traditional and LLM-based methods in causal structure learning. We conclude with a call for the community to shift focus from naively applying LLMs to developing specialized models and training method that respect the core principles of causal discovery.
Chinese: 本文主张由于大语言模型缺乏因果推理的理论基础,不应直接用于因果决策,但可作为辅助工具参与启发式搜索以改进因果结构学习。
English: This paper argues that LLMs should not be directly used for causal decision-making due to their inherent lack of theoretical grounding for causal reasoning, but can serve as auxiliary tools in heuristic searches to enhance causal structure learning.
Authors:Jingyi Yang, Shuai Shao, Dongrui Liu, Jing Shao
Abstract:
With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on evaluating the safety risks of MLLM-based computer-use agents suffers from several limitations: it either lacks realistic interactive environments, or narrowly focuses on one or a few specific risk types. These limitations ignore the complexity, variability, and diversity of real-world environments, thereby restricting comprehensive risk evaluation for computer-use agents. To this end, we introduce \textbf{RiOSWorld}, a benchmark designed to evaluate the potential risks of MLLM-based agents during real-world computer manipulations. Our benchmark includes 492 risky tasks spanning various computer applications, involving web, social media, multimedia, os, email, and office software. We categorize these risks into two major classes based on their risk source: (i) User-originated risks and (ii) Environmental risks. For the evaluation, we evaluate safety risks from two perspectives: (i) Risk goal intention and (ii) Risk goal completion. Extensive experiments with multimodal agents on \textbf{RiOSWorld} demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents. Our benchmark is publicly available at https://yjyddq.github.io/RiOSWorld.github.io/.
中文摘要:RIOSWorld基准旨在评估多模态智能体在真实计算机操作中的安全风险,发现当前系统存在显著隐患,强调了加强安全防护的紧迫性。
English Summary: The RIOSWorld benchmark is introduced to assess safety risks of multimodal agents in real-world computer tasks, revealing significant vulnerabilities and emphasizing the need for improved safety measures.
Authors:Shaofeng Zhang, Shengcai Liu, Ning Lu, Jiahao Wu, Ji Liu, Yew-Soon Ong, Ke Tang
Abstract:
Combinatorial optimization problems are widely encountered in real-world applications. Designing high-quality heuristic algorithms that efficiently approximate optimal solutions within reasonable time is a critical research challenge. In recent years, many works have explored integrating Large Language Models (LLMs) with Evolutionary Algorithms to automate heuristic algorithm design through prompt engineering. However, these approaches generally adopt a problem-specific paradigm, applying a single algorithm across all problem instances, failing to account for the heterogeneity across instances. In this paper, we propose InstSpecHH, a novel framework that introduces the concept of instance-specific heuristic generation. InstSpecHH partitions the overall problem class into sub-classes based on instance features and performs differentiated, automated heuristic design for each problem subclass. By tailoring heuristics to the unique features of different sub-classes, InstSpecHH achieves better performance at the problem class level while avoiding redundant heuristic generation for similar instances, thus reducing computational overhead. This approach effectively balances the trade-off between the cost of automatic heuristic design and the quality of the obtained solutions. To evaluate the performance of InstSpecHH, we conduct experiments on 4,500 subclasses of the Online Bin Packing Problem (OBPP) and 365 subclasses of the Capacitated Vehicle Routing Problem (CVRP). Experimental results show that InstSpecHH demonstrates strong intra-subclass and inter-subclass generalization capabilities. Compared to previous problem-specific methods, InstSpecHH reduces the average optimality gap by more than 5.6\% for OBPP and 0.9\% for CVRP. These results highlight the potential of instance-aware automatic heuristic design to further enhance solution quality.
中文: 本文提出InstSpecHH框架,通过基于特征将问题类别划分为子类并生成实例特定启发式算法,相比先前问题特定方法在提升求解质量的同时有效降低了计算开销。
English: This paper introduces InstSpecHH, a framework that generates instance-specific heuristics by partitioning problem classes into subclasses based on features, achieving improved performance and reduced computational costs compared to previous problem-specific methods.
Authors:Renzo J. Scholman, Tanja Alderliesten, Peter A. N. Bosman
Abstract:
The Gene-pool Optimal Mixing EA (GOMEA) family of EAs offers a specific means to exploit problem-specific knowledge through linkage learning, i.e., inter-variable dependency detection, expressed using subsets of variables, that should undergo joint variation. Such knowledge can be exploited if faster fitness evaluations are possible when only a few variables are changed in a solution, enabling large speed-ups. The recent-most version of Real-Valued GOMEA (RV-GOMEA) can learn a conditional linkage model during optimization using fitness-based linkage learning, enabling fine-grained dependency exploitation in learning and sampling a Gaussian distribution. However, while the most efficient Gaussian-based EAs, like NES and CMA-ES, employ incremental learning of the Gaussian distribution rather than performing full re-estimation every generation, the recent-most RV-GOMEA version does not employ such incremental learning. In this paper, we therefore study whether incremental distribution estimation can lead to efficiency enhancements of RV-GOMEA. We consider various benchmark problems with varying degrees of overlapping dependencies. We find that, compared to RV-GOMEA and VKD-CMA-ES, the required number of evaluations to reach high-quality solutions can be reduced by a factor of up to 1.5 if population sizes are tuned problem-specifically, while a reduction by a factor of 2-3 can be achieved with generic population-sizing guidelines.
Chinese: 本文研究在实数编码基因池优化混合算法中引入增量分布估计是否能提升效率,发现相比现有方法,在问题特定调优下评估次数最多可减少1.5倍,而采用通用参数准则时能实现2-3倍的减少。
English: This paper investigates whether incorporating incremental distribution estimation into the Real-Valued GOMEA algorithm can improve its efficiency, finding that it reduces the required evaluations by up to a factor of 1.5 with problem-specific tuning and 2-3 with generic guidelines compared to existing methods.
Authors:Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi Saruwatari
Abstract:
Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.
Chinese: CLAP评分与人类主观评价相关性较低,但提出的Human-CLAP模型通过使用主观评分训练,将斯皮尔曼等级相关系数提升了0.25以上,显著改善了这种对应关系。
English: CLAPScore shows low correlation with human subjective evaluations, but the proposed Human-CLAP model significantly improves this alignment by over 0.25 in Spearman's rank correlation coefficient.
Authors:Jiale Meng, Yiming Li, Zheming Lu, Zewei He, Hao Luo, Tianwei Zhang
Abstract:
Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.
中文摘要:本文提出名为CoreMark的新型文本水印框架,采用CORE嵌入范式,在保持良好隐蔽性的同时,显著提升了抗截图、打印扫描等攻击的鲁棒性,并具备多语言通用性。
English Summary: The paper introduces CoreMark, a novel text watermarking framework using CORE segments that achieves superior robustness against various attacks while maintaining imperceptibility and cross-language applicability.
Authors:Yiling Xu, Yujie Zhang, Shuting Xia, Kaifa Yang, He Huang, Ziyu Shan, Wenjie Huang, Qi Yang, Le Yang
Abstract:
The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.
中文: 本文综述了点云压缩与质量评估的最新进展,针对其不规则结构和高数据量等挑战,对现有方法进行基准测试,并展望了未来高效三维应用的发展方向。
English: This paper surveys recent advances in point cloud compression and quality assessment, addressing challenges like irregular structure and high data volume while benchmarking methods and outlining future directions for more efficient 3D applications.
Authors:Qilong Xing, Zikai Song, Yuteng Ye, Yuke Chen, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang
Abstract:
Segmentation of brain structures from MRI is crucial for evaluating brain morphology, yet existing CNN and transformer-based methods struggle to delineate complex structures accurately. While current diffusion models have shown promise in image segmentation, they are inadequate when applied directly to brain MRI due to neglecting anatomical information. To address this, we propose Collaborative Anatomy Diffusion (CA-Diff), a framework integrating spatial anatomical features to enhance segmentation accuracy of the diffusion model. Specifically, we introduce distance field as an auxiliary anatomical condition to provide global spatial context, alongside a collaborative diffusion process to model its joint distribution with anatomical structures, enabling effective utilization of anatomical features for segmentation. Furthermore, we introduce a consistency loss to refine relationships between the distance field and anatomical structures and design a time adapted channel attention module to enhance the U-Net feature fusion procedure. Extensive experiments show that CA-Diff outperforms state-of-the-art (SOTA) methods.
Chinese: 提出的协作解剖扩散(CA-Diff)框架通过引入距离场和协作扩散过程整合空间解剖特征,显著提升了脑部MRI分割精度,超越了现有最先进方法。
English: The proposed Collaborative Anatomy Diffusion (CA-Diff) framework integrates spatial anatomical features through distance fields and collaborative diffusion processes to significantly enhance brain MRI segmentation accuracy, outperforming current state-of-the-art methods.
Authors:Sarah Seifi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille
Abstract:
Rule-based models offer interpretability but struggle with complex data, while deep neural networks excel in performance yet lack transparency. This work investigates a neuro-symbolic rule learning neural network named RL-Net that learns interpretable rule lists through neural optimization, applied for the first time to radar-based hand gesture recognition (HGR). We benchmark RL-Net against a fully transparent rule-based system (MIRA) and an explainable black-box model (XentricAI), evaluating accuracy, interpretability, and user adaptability via transfer learning. Our results show that RL-Net achieves a favorable trade-off, maintaining strong performance (93.03% F1) while significantly reducing rule complexity. We identify optimization challenges specific to rule pruning and hierarchy bias and propose stability-enhancing modifications. Compared to MIRA and XentricAI, RL-Net emerges as a practical middle ground between transparency and performance. This study highlights the real-world feasibility of neuro-symbolic models for interpretable HGR and offers insights for extending explainable AI to edge-deployable sensing systems.
中文摘要:RL-Net是一种神经符号模型,在手势识别中实现了可解释性与性能的平衡,相比透明规则系统和黑箱模型,在保持93.03% F1分数的同时显著降低了规则复杂度。
English Summary: RL-Net is a neuro-symbolic model that balances interpretability and performance in hand gesture recognition, achieving 93.03% F1 score while reducing rule complexity compared to transparent and black-box alternatives.
Authors:Jorge Torres Gómez, Pit Hofmann, Lisa Y. Debus, Osman Tugay BaÅaran, Sebastian Lotter, Roya Khanzadeh, Stefan Angerbauer, Bige Deniz Unluturk, Sergi Abadal, Werner Haselmayr, Frank H. P. Fitzek, Robert Schober, Falko Dressler
Abstract:
Recent developments in the Internet of Bio-Nano Things (IoBNT) are laying the groundwork for innovative applications across the healthcare sector. Nanodevices designed to operate within the body, managed remotely via the internet, are envisioned to promptly detect and actuate on potential diseases. In this vision, an inherent challenge arises due to the limited capabilities of individual nanosensors; specifically, nanosensors must communicate with one another to collaborate as a cluster. Aiming to research the boundaries of the clustering capabilities, this survey emphasizes data-driven communication strategies in molecular communication (MC) channels as a means of linking nanosensors. Relying on the flexibility and robustness of machine learning (ML) methods to tackle the dynamic nature of MC channels, the MC research community frequently refers to neural network (NN) architectures. This interdisciplinary research field encompasses various aspects, including the use of NNs to facilitate communication in MC environments, their implementation at the nanoscale, explainable approaches for NNs, and dataset generation for training. Within this survey, we provide a comprehensive analysis of fundamental perspectives on recent trends in NN architectures for MC, the feasibility of their implementation at the nanoscale, applied explainable artificial intelligence (XAI) techniques, and the accessibility of datasets along with best practices for their generation. Additionally, we offer open-source code repositories that illustrate NN-based methods to support reproducible research for key MC scenarios. Finally, we identify emerging research challenges, such as robust NN architectures, biologically integrated NN modules, and scalable training strategies.
中文: 本综述探讨了在生物纳米物联网中,利用数据驱动的机器学习方法特别是神经网络来增强分子通信,以实现纳米传感器集群协作,涵盖了实施可行性、可解释人工智能及数据集生成,并指出了未来的研究方向。
English: This survey explores data-driven machine learning approaches, particularly neural networks, to enhance molecular communication for clustering nanosensors in the Internet of Bio-Nano Things, addressing implementation challenges, explainable AI, and dataset generation while identifying future research directions.
Authors:Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, Weidong Cai
Abstract:
Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian samples toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned latent space, where the model iteratively refines a sample toward regions of higher probability. However, this learned climbing often converges to local optima with plausible but suboptimal generations due to latent space complexity and suboptimal initialization. While prior efforts often strengthen guidance signals or introduce fixed exploration strategies to address this, they exhibit limited capacity to escape steep local maxima. In contrast, we propose Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy that adaptively detects and escapes such traps through controlled exploration. In each diffusion step, we first identify potential local maxima using a reward model. Upon such detection, we inject noise and revert to a previous, noisier state to escape the current plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, otherwise scheming progressively deeper explorations when nearby alternatives fail. This controlled zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed method is model-agnostic and also compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality with only around 6.72x increase in the number of function evaluations.
中文: 提出的受控随机之字形采样(Ctrl-Z采样)通过检测扩散模型中的局部最优解,在识别到平台期时注入噪声并回退至先前状态,再利用奖励模型评估轨迹以提升生成质量。
English: The proposed Controlled Random Zigzag Sampling (Ctrl-Z Sampling) adaptively detects and escapes local optima in diffusion models by injecting noise and reverting to previous states when plateaus are identified, then evaluating trajectories with a reward model to enhance generation quality.
Authors:Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, Weidong Cai
Abstract:
Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian samples toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned representation space, where the model iteratively refines a sample toward regions of higher probability. However, this learned climbing often converges to local optima with plausible but suboptimal generations due to latent space complexity and suboptimal initialization. While prior efforts often strengthen guidance signals or introduce fixed exploration strategies to address this, they exhibit limited capacity to escape steep local maxima. In contrast, we propose Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy that adaptively detects and escapes such traps through controlled exploration. In each diffusion step, we first identify potential local maxima using a reward model. Upon such detection, we inject noise and revert to a previous, noisier state to escape the current plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, otherwise scheming progressively deeper explorations when nearby alternatives fail. This controlled zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed method is model-agnostic and also compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality while requiring only about 7.72 times the NFEs of the original.
中文: 提出的受控随机之字形采样(Ctrl-Z采样)通过检测扩散模型中的局部最优解,在识别到平台期时注入噪声并回退至先前状态,再利用奖励模型评估轨迹以提升生成质量。
English: The proposed Controlled Random Zigzag Sampling (Ctrl-Z Sampling) adaptively detects and escapes local optima in diffusion models by injecting noise and reverting to previous states when plateaus are identified, then evaluating trajectories with a reward model to enhance generation quality.
Authors:Juraj Vladika, Ihsan Soydemir, Florian Matthes
Abstract:
While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations -- factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summarization. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries using evidence from three search engines. We analyze the results and provide insights into systems' performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.
中文: 大型语言模型常产生事实错误的幻觉,但利用内部或外部知识迭代验证并修正回答的自我纠正方法展现出潜力,尤其在新闻摘要中结合搜索引擎证据的应用,能有效提升准确性并与人工评估高度一致。
English: Large language models often produce factually inaccurate hallucinations, but self-correcting methods that iteratively verify and refine responses using internal or external knowledge show promise, particularly when applied with search engine evidence in news summarization to enhance accuracy and align with human evaluations.
Authors:Sudesh Bhagat, Raghupathi Kandiboina, Ibne Farabi Shihab, Skylar Knickerbocker, Neal Hawkins, Anuj Sharma
Abstract:
Road traffic crashes are a significant global cause of fatalities, emphasizing the urgent need for accurate crash data to enhance prevention strategies and inform policy development. This study addresses the challenge of alcohol inference mismatch (AIM) by employing database narrative alignment to identify AIM in crash data. A framework was developed to improve data quality in crash management systems and reduce the percentage of AIM crashes. Utilizing the BERT model, the analysis of 371,062 crash records from Iowa (2016-2022) revealed 2,767 AIM incidents, resulting in an overall AIM percentage of 24.03%. Statistical tools, including the Probit Logit model, were used to explore the crash characteristics affecting AIM patterns. The findings indicate that alcohol-related fatal crashes and nighttime incidents have a lower percentage of the mismatch, while crashes involving unknown vehicle types and older drivers are more susceptible to mismatch. The geospatial cluster as part of this study can identify the regions which have an increased need for education and training. These insights highlight the necessity for targeted training programs and data management teams to improve the accuracy of crash reporting and support evidence-based policymaking.
中文摘要:本研究通过开发数据质量改进框架,利用BERT模型识别出24.03%的酒精推断失配事故,揭示了关键事故特征和地理聚集规律,为针对性培训和精准政策制定提供重要依据。
English Summary: This study tackles alcohol inference mismatch in crash data by developing a framework that identified 24.03% AIM incidents through BERT analysis, revealing key crash characteristics and geographic clusters to guide targeted training and policy improvements.
Authors:Shuyin Xia, Guan Wang, Gaojie Xu, Sen Zhao, Guoyin Wang
Abstract:
The objective of graph coarsening is to generate smaller, more manageable graphs while preserving key information of the original graph. Previous work were mainly based on the perspective of spectrum-preserving, using some predefined coarsening rules to make the eigenvalues of the Laplacian matrix of the original graph and the coarsened graph match as much as possible. However, they largely overlooked the fact that the original graph is composed of subregions at different levels of granularity, where highly connected and similar nodes should be more inclined to be aggregated together as nodes in the coarsened graph. By combining the multi-granularity characteristics of the graph structure, we can generate coarsened graph at the optimal granularity. To this end, inspired by the application of granular-ball computing in multi-granularity, we propose a new multi-granularity, efficient, and adaptive coarsening method via granular-ball (GBGC), which significantly improves the coarsening results and efficiency. Specifically, GBGC introduces an adaptive granular-ball graph refinement mechanism, which adaptively splits the original graph from coarse to fine into granular-balls of different sizes and optimal granularity, and constructs the coarsened graph using these granular-balls as supernodes. In addition, compared with other state-of-the-art graph coarsening methods, the processing speed of this method can be increased by tens to hundreds of times and has lower time complexity. The accuracy of GBGC is almost always higher than that of the original graph due to the good robustness and generalization of the granular-ball computing, so it has the potential to become a standard graph data preprocessing method.
中文: 本文提出了一种基于粒球的多粒度图粗化方法(GBGC),通过自适应地将节点聚合成不同粒度的超节点来高效简化图结构,在速度和精度上均显著优于现有方法。
English: This paper introduces a multi-granularity graph coarsening method called GBGC, which uses granular-ball computing to adaptively group nodes into supernodes for efficient and accurate graph simplification, significantly improving speed and performance over existing methods.
Authors:Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang
Abstract:
Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets' inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework. The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels. Each ball can enclose data points based on a density consistency measurement. Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering. Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.
中文摘要:本文提出了一种基于粒球计算的多核K均值聚类框架(GB-MKKM),通过粒球自适应表征多核空间的数据分布,显著提升了聚类效率和对噪声的鲁棒性。
English Summary: This paper introduces a granular-ball multi-kernel K-means framework (GB-MKKM) that enhances computational efficiency and clustering robustness by using granular-ball computing to adaptively represent data distributions in multiple kernel spaces.
Authors:Gengyuan Zhang, Tanveer Hannan, Hermine Kleiner, Beste Aydemir, Xinyu Xie, Jian Lan, Thomas Seidl, Volker Tresp, Jindong Gu
Abstract:
An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.
中文: 理想的视觉语言代理需应对查询与证据异步的挑战,在动态流数据环境中为用户提供准确且时间感知的响应;本文提出的AViLA框架通过综合记忆保留、证据识别和证据触发三大模块,显著提升了现有模型的准确性和时间感知能力。
English: An ideal vision-language agent must address the challenge of Query-Evidence Asynchrony by providing accurate, time-aware responses to user queries in dynamic streaming environments, as demonstrated by the proposed AViLA framework which significantly enhances both accuracy and temporal awareness compared to existing models.
Authors:Xiangyuan Peng, Miao Tang, Huawei Sun, Bierzynski Kay, Lorenzo Servadei, Robert Wille
Abstract:
LiDAR and 4D radar are widely used in autonomous driving and robotics. While LiDAR provides rich spatial information, 4D radar offers velocity measurement and remains robust under adverse conditions. As a result, increasing studies have focused on the 4D radar-LiDAR fusion method to enhance the perception. However, the misalignment between different modalities is often overlooked. To address this challenge and leverage the strengths of both modalities, we propose a LiDAR detection framework enhanced by 4D radar motion status and cross-modal uncertainty. The object movement information from 4D radar is first captured using a Dynamic Motion-Aware Encoding module during feature extraction to enhance 4D radar predictions. Subsequently, the instance-wise uncertainties of bounding boxes are estimated to mitigate the cross-modal misalignment and refine the final LiDAR predictions. Extensive experiments on the View-of-Delft (VoD) dataset highlight the effectiveness of our method, achieving state-of-the-art performance with the mAP of 74.89% in the entire area and 88.70% within the driving corridor while maintaining a real-time inference speed of 30.02 FPS.
Chinese: 本研究提出了一种融合4D雷达运动状态与跨模态不确定性的LiDAR检测框架,有效解决了多模态错位问题,在VoD数据集上实现了最优性能并保持了实时处理能力。
English: This study introduces a LiDAR detection framework enhanced by 4D radar motion data and cross-modal uncertainty to address misalignment issues, achieving state-of-the-art performance on the VoD dataset with real-time processing.
Authors:Lukas Brand, Fardad Vakilipoor, Sören Botsch, Timo Jakumeit, Sebastian Lotter, Robert Schober, Maximilian Schäfer
Abstract:
This paper presents a novel physics-based model for signal propagation in closed-loop molecular communication (MC) systems, which are particularly relevant for many envisioned biomedical applications, such as health monitoring or drug delivery within the closed-loop human cardiovascular system (CVS). Compared to open-loop systems, which are mostly considered in MC, closed-loop systems exhibit different characteristic effects influencing signaling molecule (SM) propagation. One key phenomenon are the periodic SM arrivals at the receiver (RX), leading to various types of inter-symbol interference (ISI) inherent to closed-loop system. To capture these characteristic effects, we propose an analytical model for the SM propagation inside closed-loop systems. The model accounts for arbitrary spatio-temporal SM release patterns at the transmitter (TX), and incorporates several environmental effects such as fluid flow, SM diffusion, and SM degradation. Moreover, to capture a wide range of practically relevant degradation and clearance mechanisms, the model includes both local removal (e.g., due to SM absorption into organs) and global removal (e.g., due to chemical degradation) of SMs. The accuracy of the proposed model is validated with three-dimensional (3-D) particle-based simulations (PBSs). Moreover, we utilize the proposed model to develop a rigorous characterization of the various types of ISI encountered in closed-loop MC systems.
中文摘要:本文提出了一种基于物理学的闭环分子通信系统信号传播分析模型,通过三维仿真验证了其准确性,并对系统固有的符号间干扰进行了严格表征。
English Summary: This paper introduces a physics-based analytical model for signal propagation in closed-loop molecular communication systems, incorporating fluid flow, diffusion, and degradation effects, and validates its accuracy through 3-D simulations while characterizing inter-symbol interference.
Authors:Wenyang Luo, Haina Qin, Zewen Chen, Libin Wang, Dandan Zheng, Yuming Li, Yufan Liu, Bing Li, Weiming Hu
Abstract:
Image restoration tasks like deblurring, denoising, and dehazing usually need distinct models for each degradation type, restricting their generalization in real-world scenarios with mixed or unknown degradations. In this work, we propose \textbf{Defusion}, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rely on task-specific models or ambiguous text-based priors, Defusion constructs explicit \textbf{visual instructions} that align with the visual degradation patterns. These instructions are grounded by applying degradations to standardized visual elements, capturing intrinsic degradation features while agnostic to image semantics. Defusion then uses these visual instructions to guide a diffusion-based model that operates directly in the degradation space, where it reconstructs high-quality images by denoising the degradation effects with enhanced stability and generalizability. Comprehensive experiments demonstrate that Defusion outperforms state-of-the-art methods across diverse image restoration tasks, including complex and real-world degradations.
中文: Defusion是一种一体化图像恢复框架,通过视觉指令引导的退化扩散技术有效处理多种图像退化问题,在真实场景中表现优于现有方法。
English: Defusion is an all-in-one image restoration framework that uses visual instruction-guided degradation diffusion to handle various degradations effectively, outperforming existing methods in real-world scenarios.
Authors:Tongtian Yue, Longteng Guo, Yepeng Tang, Zijia Zhao, Xinxin Zhu, Hua Huang, Jing Liu
Abstract:
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.
中文: LaVi提出了一种新型大视觉语言模型,通过内部特征调制实现高效的视觉语言融合,在保持顶尖多模态性能的同时大幅降低了计算成本。
English: LaVi introduces a novel LVLM that achieves efficient vision-language fusion through internal feature modulation, significantly reducing computational costs while maintaining state-of-the-art performance across multimodal benchmarks.
Authors:Samir Khaki, Xiuyu Li, Junxian Guo, Ligeng Zhu, Chenfeng Xu, Konstantinos N. Plataniotis, Amir Yazdanbakhsh, Kurt Keutzer, Song Han, Zhijian Liu
Abstract:
Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning, code generation, and instruction following.
中文: SparseLoRA通过无训练的稀疏性估计器动态选择稀疏权重来加速大语言模型微调,在保持多种任务准确性的同时,将计算成本降低高达2.2倍,实现1.6倍的加速效果。
English: SparseLoRA accelerates LLM fine-tuning by using a training-free sparsity estimator to dynamically select sparse weights, reducing computational cost by up to 2.2 times and achieving a 1.6 times speedup while maintaining accuracy across multiple tasks.
Authors:Sen Wang, Le Wang, Sanping Zhou, Jingyi Tian, Jiayi Li, Haowen Sun, Wei Tang
Abstract:
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.
中文: FlowRAM提出了一种利用生成模型实现区域感知和动态自适应感知的新框架,在机器人操作中取得了顶尖性能,成功率提高12.0%,推理速度显著提升至4步以内。
English: FlowRAM introduces a novel framework using generative models for region-aware perception and dynamic adaptive sensing, achieving state-of-the-art performance in robotic manipulation with a 12.0% higher success rate and significantly faster inference in under 4 steps.
Authors:Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, Derek F. Wong
Abstract:
Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.
中文: 本研究提出了一种自引导迭代校准框架,通过利用不确定性分数迭代优化文档相关性和回答置信度,显著提升了各类大语言模型的校准效果和性能。
English: This study introduces a Self-Guided Iterative Calibration Framework that leverages uncertainty scores to enhance large language models' calibration by iteratively refining document relevance and response confidence, significantly boosting performance across various models.
Authors:Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang
Abstract:
Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model's performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior, which serves as the initial event search regions to eliminate the implausible event proposals. On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions. Extensive ablation studies are conducted to verify the effectiveness of the position and relation prior. Experimental results also show the competitive performance of our method on ActivityNet Captions and YouCook2 datasets.
Chinese: 本文提出PR-DETR框架,通过将明确的位置和关系先验知识引入检测变换器,有效提升了密集视频描述中的事件定位精度和字幕质量,在标准数据集上表现出优越性能。
English: The paper introduces PR-DETR, a dense video captioning framework that enhances event localization and caption quality by incorporating explicit position and relation priors into a detection transformer, achieving competitive results on benchmark datasets.
Authors:Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo
Abstract:
Large Language Models (LLMs) exhibit enhanced reasoning capabilities by employing Chain-of-Thought (CoT). However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache size, particularly in tasks requiring long reasoning sequences, such as mathematics and programming. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens receive renewed attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, a lagged KV eviction framework designed to maintain reasoning performance while reducing KV memory. LazyEviction is an Observation Window-based Lagged Eviction Mechanism retaining latent recurring tokens by performing lagged evictions across decoding steps, which contains two key components: (1) Recurrence Interval Tracking for capturing temporal variations in token importance, and (2) an Maximum Recurrence Interval-Centric Eviction Policy that prioritizes eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache size by 50% while maintaining comparable accuracy on mathematics reasoning datasets, outperforming state-of-the-art methods. Our findings highlight the importance of preserving recurring tokens, which are critical for maintaining knowledge continuity in multi-step reasoning tasks.
Chinese: 本文提出LazyEviction延迟淘汰框架,通过追踪令牌重要性复现规律并保留周期性关键令牌,在数学推理任务中将KV缓存压缩50%的同时保持推理精度,优于现有方法。
English: This paper introduces LazyEviction, a lagged KV eviction framework that reduces GPU memory usage by 50% while preserving reasoning accuracy in long-sequence tasks by identifying and retaining periodically important tokens through recurrence interval tracking.
Authors:Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Abstract:
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs) by using additional inference-time computation to explore multiple reasoning paths through search. Yet how to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time. To bridge this gap, we formulate test-time search as a resource allocation problem and derive the optimal allocation strategy that maximizes the probability of obtaining a correct solution under a fixed rollout budget. Within this formulation, we reveal a core limitation of existing search methods: solution-level allocation tends to favor reasoning directions with more candidates, leading to theoretically suboptimal and inefficient use of compute. To address this, we propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias by decoupling direction quality from candidate count and allocating resources at the direction level. To demonstrate DORA's effectiveness, we conduct extensive experiments on challenging mathematical reasoning benchmarks including MATH500, AIME2024, and AIME2025. The empirical results show that DORA consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art accuracy. We hope our findings contribute to a broader understanding of optimal TTS for LLMs.
中文摘要:测试时缩放(TTS)通过优化推理计算资源分配提升大语言模型性能,而提出的方向导向资源分配(DORA)方法在数学推理基准测试中实现了最优精度,同时更高效地利用了计算资源。
English Summary: Test-Time Scaling (TTS) enhances LLMs by allocating inference-time computation efficiently, and the proposed Direction-Oriented Resource Allocation (DORA) method optimizes this process to achieve state-of-the-art accuracy on challenging benchmarks while using computational resources more effectively.
Authors:Boah Kim, Tejas Sudharshan Mathai, Kimberly Helm, Peter A. Pinto, Ronald M. Summers
Abstract:
Multi-parametric magnetic resonance imaging (mpMRI) exams have various series types acquired with different imaging protocols. The DICOM headers of these series often have incorrect information due to the sheer diversity of protocols and occasional technologist errors. To address this, we present a deep learning-based classification model to classify 8 different body mpMRI series types so that radiologists read the exams efficiently. Using mpMRI data from various institutions, multiple deep learning-based classifiers of ResNet, EfficientNet, and DenseNet are trained to classify 8 different MRI series, and their performance is compared. Then, the best-performing classifier is identified, and its classification capability under the setting of different training data quantities is studied. Also, the model is evaluated on the out-of-training-distribution datasets. Moreover, the model is trained using mpMRI exams obtained from different scanners in two training strategies, and its performance is tested. Experimental results show that the DenseNet-121 model achieves the highest F1-score and accuracy of 0.966 and 0.972 over the other classification models with p-value$<$0.05. The model shows greater than 0.95 accuracy when trained with over 729 studies of the training data, whose performance improves as the training data quantities grew larger. On the external data with the DLDS and CPTAC-UCEC datasets, the model yields 0.872 and 0.810 accuracy for each. These results indicate that in both the internal and external datasets, the DenseNet-121 model attains high accuracy for the task of classifying 8 body MRI series types.
中文: 本研究开发了一种基于DenseNet-121的深度学习模型,能够准确分类八种不同的身体多参数磁共振序列类型,在内部和外部数据集上均表现出高准确率,且性能随训练数据量增加而提升。
English: This study developed a deep learning model using DenseNet-121 to accurately classify eight different body mpMRI series types, achieving high accuracy on both internal and external datasets while demonstrating improved performance with increased training data.
Authors:Renjith Prasad, Abhilekh Borah, Hasnat Md Abdullah, Chathurangi Shyalika, Gurpreet Singh, Ritvik Garimella, Rajarshi Roy, Harshul Surana, Nasrin Imanpour, Suranjana Trivedy, Amit Sheth, Amitava Das
Abstract:
Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.
中文: 本文提出DPO-Kernels创新框架,通过混合损失函数、核化表示和扩展散度选择来增强文本到图像模型的对齐能力,同时建立DETONATE基准测试和Alignment Quality Index指标来评估社会偏见漏洞。
English: This paper introduces DPO-Kernels, a novel extension for text-to-image models that enhances alignment through hybrid loss functions, kernelized representations, and expanded divergence selection, while proposing the DETONATE benchmark and Alignment Quality Index to evaluate social bias vulnerabilities.
Authors:Hyeon Jeon, Hyunwook Lee, Yun-Hsin Kuo, Taehyun Yang, Daniel Archambault, Sungahn Ko, Takanori Fujiwara, Kwan-Liu Ma, Jinwook Seo
Abstract:
Visual analytics using dimensionality reduction (DR) can easily be unreliable for various reasons, e.g., inherent distortions in representing the original data. The literature has thus proposed a wide range of methodologies to make DR-based visual analytics reliable. However, the diversity and extensiveness of the literature can leave novice analysts and researchers uncertain about where to begin and proceed. To address this problem, we propose a guide for reading papers for reliable visual analytics with DR. Relying on the previous classification of the relevant literature, our guide helps both practitioners to (1) assess their current DR expertise and (2) identify papers that will further enhance their understanding. Interview studies with three experts in DR and data visualizations validate the significance, comprehensiveness, and usefulness of our guide.
中文: 该摘要提出了一份指南,帮助从业者评估其降维专业知识并确定相关文献以实现可靠的可视化分析,其有效性已通过专家访谈得到验证。
English: The abstract proposes a guide to help practitioners assess their dimensionality reduction expertise and identify relevant literature for reliable visual analytics, validated through expert interviews.
Authors:Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Abstract:
The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.
The integration of Large Language Models with computer vision is revolutionizing image segmentation for intelligent transportation systems, offering enhanced scene understanding for autonomous driving and traffic monitoring while facing challenges in real-time performance and safety-critical reliability.
English Summary:
Authors:Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong
Abstract:
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
中文摘要:本研究提出一个受认知心理学启发的框架,将大语言模型的学习能力分解为从教师指导、概念理解和经验积累三个维度,通过实证研究发现互动促进学习、概念理解随模型规模涌现等关键规律,并建立统一基准以评估和开发更具适应性的模型。
English Summary: This study introduces a cognitive psychology-inspired framework to evaluate large language models' learning abilities across three dimensions—Learning from Instructor, Concept, and Experience—revealing key insights like interaction-driven improvement and scale-emergent conceptual understanding, leading to a unified benchmark for diagnostic evaluation and development of adaptive models.
Authors:Zhelun Shen, Chenming Wu, Junsheng Zhou, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Wei He, Jingdong Wang
Abstract:
Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting the designated object into the hand region, providing a reference for subsequent frames. The second stage ensures temporal coherence and fluidity in hand-object interactions. The key contribution of our method is to reuse the pretrained model's context perception capabilities without introducing additional parameters, enabling strong generalization to unseen objects and scenarios, and our proposed paradigm naturally supports long video generation. Comprehensive evaluations demonstrate that our approach outperforms existing methods, particularly in challenging real-world scenes, offering enhanced realism and more seamless hand-object interactions.
中文: 本文提出iDiT-HOI框架,通过两阶段视频扩散变换器和统一的修复式标记处理方法,无需额外参数即可生成逼真的手物交互视频,在真实场景中展现出优越性能。
English: This paper introduces iDiT-HOI, a novel framework using a two-stage video diffusion transformer and unified inpainting-based token process to generate realistic hand-object interactions in videos, demonstrating superior performance in real-world scenarios without additional parameters.
Authors:Mufan Liu, Cixiao Zhang, Qi Yang, Yujie Cao, Yiling Xu, Yin Xu, Shu Sun, Mingzeng Dai, Yunfeng Guan
Abstract:
Modeling the wireless radiance field (WRF) is fundamental to modern communication systems, enabling key tasks such as localization, sensing, and channel estimation. Traditional approaches, which rely on empirical formulas or physical simulations, often suffer from limited accuracy or require strong scene priors. Recent neural radiance field (NeRF-based) methods improve reconstruction fidelity through differentiable volumetric rendering, but their reliance on computationally expensive multilayer perceptron (MLP) queries hinders real-time deployment. To overcome these challenges, we introduce Gaussian splatting (GS) to the wireless domain, leveraging its efficiency in modeling optical radiance fields to enable compact and accurate WRF reconstruction. Specifically, we propose SwiftWRF, a deformable 2D Gaussian splatting framework that synthesizes WRF spectra at arbitrary positions under single-sided transceiver mobility. SwiftWRF employs CUDA-accelerated rasterization to render spectra at over 100000 fps and uses a lightweight MLP to model the deformation of 2D Gaussians, effectively capturing mobility-induced WRF variations. In addition to novel spectrum synthesis, the efficacy of SwiftWRF is further underscored in its applications in angle-of-arrival (AoA) and received signal strength indicator (RSSI) prediction. Experiments conducted on both real-world and synthetic indoor scenes demonstrate that SwiftWRF can reconstruct WRF spectra up to 500x faster than existing state-of-the-art methods, while significantly enhancing its signal quality. The project page is https://evan-sudo.github.io/swiftwrf/.
中文摘要:SwiftWRF提出了一种可变形二维高斯泼溅框架,通过CUDA加速渲染和轻量级神经网络建模,在实现比现有最优方法快500倍的无线辐射场重建速度的同时,显著提升了信号质量。
English Summary: SwiftWRF introduces a deformable 2D Gaussian splatting framework that achieves 500x faster wireless radiance field reconstruction than state-of-the-art methods while significantly improving signal quality through CUDA-accelerated rendering and lightweight neural modeling.
Authors:Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu
Abstract:
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Experimental results demonstrate that our method achieves up to 1.5$\times$ acceleration with less than 3% drop in accuracy, outperforming existing approaches in multiple tasks.
中文摘要:SP-VLA框架通过动作感知的模型调度机制动态切换VLA模型与轻量生成器,并结合空间语义双感知的令牌剪枝方法,在保持高精度的同时实现了视觉-语言-动作模型的有效加速。
English Summary: The proposed SP-VLA framework accelerates Vision-Language-Action models by dynamically switching between VLA and lightweight generators for different action types while pruning redundant visual tokens, achieving significant speedup without performance loss.
Authors:Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu
Abstract:
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Extensive experiments show that our method achieves 1.5$\times$ lossless acceleration in LIBERO and 2.4$\times$ in SimplerEnv, with up to 6% average performance gain. Inference frequency and latency improve by 2.2$\times$ in SimplerEnv and 1.4$\times$ in LIBERO.
中文摘要:SP-VLA框架通过动作感知的模型调度机制动态切换VLA模型与轻量生成器,并结合空间语义双感知的令牌剪枝方法,在保持高精度的同时实现了视觉-语言-动作模型的有效加速。
English Summary: The proposed SP-VLA framework accelerates Vision-Language-Action models by dynamically switching between VLA and lightweight generators for different action types while pruning redundant visual tokens, achieving significant speedup without performance loss.
Authors:Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Abstract:
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
中文: 信息与通信技术的融合推动了大型人工智能模型的兴起,但面临资源消耗和带宽的挑战,AI Flow框架通过设备-边缘-云架构、家族模型和基于连接性的智能涌现来解决这些问题,以提升AI服务的智能性、响应速度和普及性。
English: The convergence of information and communication technologies has driven the rise of large AI models, but faces challenges in resource consumption and bandwidth, which the AI Flow framework addresses through a device-edge-cloud structure, familial models, and connectivity-based emergent intelligence to enhance AI services.
Authors:Dongjie Yang, Chengqiang Lu, Qimeng Wang, Xinbei Ma, Yan Gao, Yao Hu, Hai Zhao
Abstract:
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences. While LLMs show promise, existing methods with long-horizon thinking struggle with handling multifaceted constraints and preferences in the context, leading to suboptimal itineraries. We formulate this as an $L^3$ planning problem, emphasizing long context, long instruction, and long output. To tackle this, we introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems. Instead of direct planning, MAoP leverages the strategist to conduct pre-planning from various aspects and provide the planning blueprint for planning models, enabling strong inference-time scalability for better performance. In addition, current benchmarks overlook travel's dynamic nature, where past events impact subsequent journeys, failing to reflect real-world feasibility. To address this, we propose Travel-Sim, an agent-based benchmark assessing plans via real-world travel simulation. This work advances LLM capabilities in complex planning and offers novel insights for evaluating sophisticated scenarios through agent-based simulation.
中文摘要:本研究提出多角度规划方法(MAoP),通过广域思维提升大语言模型在复杂旅行规划中的能力,并开发基于智能体的Travel-Sim基准测试系统以实现真实场景评估。
English Summary: The study introduces Multiple Aspects of Planning (MAoP) to enhance LLMs' ability in complex travel planning by enabling wide-horizon thinking and proposes Travel-Sim, an agent-based benchmark for realistic evaluation.
Authors:Efthymia Amarantidou, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Abstract:
The advent of accessible Generative AI tools enables anyone to create and spread synthetic images on social media, often with the intention to mislead, thus posing a significant threat to online information integrity. Most existing Synthetic Image Detection (SID) solutions struggle on generated images sourced from the Internet, as these are often altered by compression and other operations. To address this, our research enhances SID by exploring data augmentation combinations, leveraging a genetic algorithm for optimal augmentation selection, and introducing a dual-criteria optimization approach. These methods significantly improve model performance under real-world perturbations. Our findings provide valuable insights for developing detection models capable of identifying synthetic images across varying qualities and transformations, with the best-performing model achieving a mean average precision increase of +22.53% compared to models without augmentations. The implementation is available at github.com/efthimia145/sid-composite-data-augmentation.
中文: 生成式AI工具可在社交媒体上制作误导性合成图像,本研究通过数据增强和遗传算法优化合成图像检测方法,显著提升了模型对现实图像扰动的识别能力。
English: Generative AI tools enable the creation of misleading synthetic images on social media, and this research improves Synthetic Image Detection by using data augmentation and a genetic algorithm, significantly enhancing model performance against real-world image alterations.
Authors:Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun, Cong Wu, Tao Li, Zhe Chen, Wei Ni, Jun Luo
Abstract:
Traditional CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on the original image characteristics, resulting in distortions that hinder human interpretation and limit their applicability in scenarios where no initial input images are available. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (DAC), a novel framework that generates high-fidelity adversarial examples guided by attacker-specified semantics information. Leveraging a Large Language Model (LLM), DAC enhances CAPTCHA diversity and enriches the semantic information. To address various application scenarios, we examine the white-box targeted attack scenario and the black box untargeted attack scenario. For target attacks, we introduce two latent noise variables that are alternately guided in the diffusion step to achieve robust inversion. The synergy between gradient guidance and latent variable optimization achieved in this way ensures that the generated adversarial examples not only accurately align with the target conditions but also achieve optimal performance in terms of distributional consistency and attack effectiveness. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-DAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show that the defensive adversarial CAPTCHA generated by BP-DAC is able to defend against most of the unknown models, and the generated CAPTCHA is indistinguishable to both humans and DNNs.
中文摘要:提出的无源对抗验证码(DAC)框架通过大语言模型的语义引导生成高保真对抗样本,采用双路径优化策略增强验证码对自动化攻击的防御能力,在黑白盒攻击场景下均能有效保护系统安全。
English Summary: The proposed Unsourced Adversarial CAPTCHA (DAC) framework generates high-fidelity adversarial examples using semantic guidance from Large Language Models to enhance CAPTCHA security against automated attacks, with specialized strategies for both white-box targeted and black-box untargeted attack scenarios.
Authors:Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie
Abstract:
Conversational search enables multi-turn interactions between users and systems to fulfill users' complex information needs. During this interaction, the system should understand the users' search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.
中文: 本教程通过融合基础理论与大语言模型催生的前沿进展,探讨对话式搜索的变革,帮助参与者掌握构建下一代系统的核心知识。
English: This tutorial explores how large language models are transforming conversational search by connecting fundamental principles with emerging advancements, equipping participants to develop next-generation systems.
Authors:Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie
Abstract:
Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
Chinese: Pisces模型通过解耦的视觉编码架构和针对性训练技术,解决了统一多模态模型在图像理解与生成任务中的性能差距,在多项基准测试中均展现出竞争力。
English: The Pisces model introduces a decoupled visual encoding architecture and specialized training techniques to bridge the performance gap between unified multimodal models and specialized ones, achieving competitive results in both image understanding and generation across multiple benchmarks.
Authors:Fuhan Cai, Yong Guo, Jie Li, Wenbo Li, Xiangzhong Fang, Jian Chen
Abstract:
Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.
中文:FastFLUX是一种架构级剪枝框架,通过用线性层替换复杂残差分支并结合局部微调策略,在保持高质量图像生成的同时显著提升了FLUX模型的推理效率。
English: FastFLUX is an architecture-level pruning framework that enhances FLUX's inference efficiency by replacing complex residual branches with linear layers and using localized fine-tuning, maintaining high image quality while significantly boosting speed.
Authors:Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney
Abstract:
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
中文: 针对ASR系统中可训练神经前端易过拟合的问题,本研究通过音频扰动和STFT域掩码两种正则化方法,有效缩小了其与传统特征提取方法的性能差距。
English: Neural front-ends for ASR systems, though trainable, often underperform due to overfitting, but this work demonstrates that combining audio perturbations and STFT-domain masking effectively bridges the performance gap with traditional methods.
Authors:Pengfei Wang, Ziyang Zhang, Wensong Wang, Shuangmin Chen, Lin Lu, Shiqing Xin, Changhe Tu
Abstract:
Extracting high-fidelity mesh surfaces from Signed Distance Fields has become a fundamental operation in geometry processing. Despite significant progress over the past decades, key challenges remain namely, how to automatically capture the intricate geometric and topological structures encoded in the zero level set of SDFs. In this paper, we present a novel isosurface extraction algorithm that introduces two key innovations: 1. An incrementally constructed power diagram through the addition of sample points, which enables repeated updates to the extracted surface via its dual regular Delaunay tetrahedralization; and 2. An adaptive point insertion strategy that identifies regions exhibiting the greatest discrepancy between the current mesh and the underlying continuous surface. As the teaser figure shows, our framework progressively refines the extracted mesh with minimal computational cost until it sufficiently approximates the underlying surface. Experimental results demonstrate that our approach outperforms sofa methods, particularly for models with intricate geometric variations and complex topologies.
Chinese: 本文提出了一种新颖的等值面提取算法,通过自适应点插入和功率图构建,逐步优化从符号距离场中提取的网格,在捕捉复杂几何和拓扑结构方面优于现有方法。
English: This paper introduces a novel isosurface extraction algorithm that progressively refines meshes from Signed Distance Fields using adaptive point insertion and power diagram construction, outperforming existing methods in capturing complex geometries and topologies.
Authors:Ye Niu, Sanping Zhou, Yizhe Li, Ye Den, Le Wang
Abstract:
In many complex scenarios, robotic manipulation relies on generative models to estimate the distribution of multiple successful actions. As the diffusion model has better training robustness than other generative models, it performs well in imitation learning through successful robot demonstrations. However, the diffusion-based policy methods typically require significant time to iteratively denoise robot actions, which hinders real-time responses in robotic manipulation. Moreover, existing diffusion policies model a time-varying action denoising process, whose temporal complexity increases the difficulty of model training and leads to suboptimal action accuracy. To generate robot actions efficiently and accurately, we present the Time-Unified Diffusion Policy (TUDP), which utilizes action recognition capabilities to build a time-unified denoising process. On the one hand, we build a time-unified velocity field in action space with additional action discrimination information. By unifying all timesteps of action denoising, our velocity field reduces the difficulty of policy learning and speeds up action generation. On the other hand, we propose an action-wise training method, which introduces an action discrimination branch to supply additional action discrimination information. Through action-wise training, the TUDP implicitly learns the ability to discern successful actions to better denoising accuracy. Our method achieves state-of-the-art performance on RLBench with the highest success rate of 82.6% on a multi-view setup and 83.8% on a single-view setup. In particular, when using fewer denoising iterations, TUDP achieves a more significant improvement in success rate. Additionally, TUDP can produce accurate actions for a wide range of real-world tasks.
中文: 时间统一扩散策略(TUDP)通过引入时间统一的去噪过程和动作判别训练,提升了机器人操作的效率与准确性,在减少迭代的同时实现了最优性能。
English: The Time-Unified Diffusion Policy (TUDP) enhances robotic manipulation by introducing a time-unified denoising process and action-wise training, achieving state-of-the-art performance with faster and more accurate action generation.
Authors:Jie Ren, Yue Xing, Yingqian Cui, Charu C. Aggarwal, Hui Liu
Abstract:
Large language model (LLM) unlearning has become a critical topic in machine learning, aiming to eliminate the influence of specific training data or knowledge without retraining the model from scratch. A variety of techniques have been proposed, including Gradient Ascent, model editing, and re-steering hidden representations. While existing surveys often organize these methods by their technical characteristics, such classifications tend to overlook a more fundamental dimension: the underlying intention of unlearning--whether it seeks to truly remove internal knowledge or merely suppress its behavioral effects. In this SoK paper, we propose a new taxonomy based on this intention-oriented perspective. Building on this taxonomy, we make three key contributions. First, we revisit recent findings suggesting that many removal methods may functionally behave like suppression, and explore whether true removal is necessary or achievable. Second, we survey existing evaluation strategies, identify limitations in current metrics and benchmarks, and suggest directions for developing more reliable and intention-aligned evaluations. Third, we highlight practical challenges--such as scalability and support for sequential unlearning--that currently hinder the broader deployment of unlearning methods. In summary, this work offers a comprehensive framework for understanding and advancing unlearning in generative AI, aiming to support future research and guide policy decisions around data removal and privacy.
中文: 本文提出了基于遗忘意图的新分类法,探讨大语言模型遗忘方法究竟是真正消除知识还是仅抑制其行为表现,同时分析了现有评估策略的局限性和实际应用中的挑战。
English: This paper proposes a new intention-based taxonomy for large language model unlearning, examining whether methods truly remove knowledge or merely suppress its effects, while also addressing evaluation limitations and practical deployment challenges.
Authors:Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo
Abstract:
Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.
扩散模型推动了视频生成的快速进步,但现有模型在兼顾提示跟随、运动合理性和视觉质量方面仍面临挑战;Seedance 1.0通过高效架构设计、多源数据优化、精细化训练和加速推理技术,实现了高质量、精准遵循指令且生成速度提升10倍的视频生成。
Diffusion models have advanced video generation, but balancing prompt adherence, motion realism, and visual quality remains challenging; Seedance 1.0 addresses this with an efficient architecture, multi-source data, optimized training, and accelerated inference, producing high-quality, prompt-following videos 10 times faster.
Authors:Xinyuan Wang, Haoyue Bai, Nanxu Gong, Wangyang Ying, Sixun Dong, Xiquan Cui, Yanjie Fu
Abstract:
Feature transformation enhances data representation by deriving new features from the original data. Generative AI offers potential for this task, but faces challenges in stable generation (consistent outputs) and valid generation (error-free sequences). Existing methods--traditional MLs' low validity and LLMs' instability--fail to resolve both. We find that LLMs ensure valid syntax, while ML's gradient-steered search stabilizes performance. To bridge this gap, we propose a teaming framework combining LLMs' symbolic generation with ML's gradient optimization. This framework includes four steps: (1) golden examples generation, aiming to prepare high-quality samples with the ground knowledge of the teacher LLM; (2) feature transformation sequence embedding and search, intending to uncover potentially superior embeddings within the latent space; (3) student LLM feature transformation, aiming to distill knowledge from the teacher LLM; (4) LLM-ML decoder teaming, dedicating to combine ML and the student LLM probabilities for valid and stable generation. The experiments on various datasets show that the teaming policy can achieve 5\% improvement in downstream performance while reducing nearly half of the error cases. The results also demonstrate the efficiency and robustness of the teaming policy. Additionally, we also have exciting findings on LLMs' capacity to understand the original data.
中文摘要:提出的协同框架结合了大语言模型的符号生成与机器学习的梯度优化,实现了稳定且有效的特征转换,使下游性能提升5%,错误率降低近半。
English Summary: The proposed teaming framework integrates large language models' symbolic generation with machine learning's gradient optimization to achieve stable and valid feature transformations, improving downstream performance by 5% and reducing errors by nearly half.
Authors:Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo
Abstract:
Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. We conduct a literature review of 114 papers to verify the prevalence of the misuse and analyze the reasonings behind it. We then execute an interview study to uncover practitioners' implicit motivations for using these techniques -- rationales often undisclosed in the literature. Our findings indicate that misuse of t-SNE and UMAP primarily stems from limited discourse on their appropriate use in visual analytics. We conclude by proposing future directions and concrete action items to promote more reasonable use of DR.
Chinese: t-SNE和UMAP在可视化分析中的误用普遍存在,主要源于使用者对降维技术理解不足,现有解决方案效果不佳,因此探讨通过自动化选择最佳投影来避免误导性分析。
English: The misuse of t-SNE and UMAP in visual analytics is widespread due to practitioners' limited understanding of dimensionality reduction, and existing solutions have proven ineffective, prompting a discussion on automating optimal projection selection to prevent misleading interpretations.
Authors:Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo
Abstract:
Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect the original distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. We investigate why this misuse occurs, and discuss methods to prevent it. To that end, we first review 136 papers to verify the prevalence of the misuse. We then interview researchers who have used dimensionality reduction (DR) to understand why such misuse occurs. Finally, we interview DR experts to examine why previous efforts failed to address the misuse. We find that the misuse of t-SNE and UMAP stems primarily from limited DR literacy among practitioners, and that existing attempts to address this issue have been ineffective. Based on these insights, we discuss potential paths forward, including the controversial but pragmatic option of automating the selection of optimal DR projections to prevent misleading analyses.
Chinese: t-SNE和UMAP在可视化分析中的误用普遍存在,主要源于使用者对降维技术理解不足,现有解决方案效果不佳,因此探讨通过自动化选择最佳投影来避免误导性分析。
English: The misuse of t-SNE and UMAP in visual analytics is widespread due to practitioners' limited understanding of dimensionality reduction, and existing solutions have proven ineffective, prompting a discussion on automating optimal projection selection to prevent misleading interpretations.
Authors:Zheng Lin, Zhe Chen, Xianhao Chen, Wei Ni, Yue Gao
Abstract:
Split federated learning (SFL) has emerged as a promising paradigm to democratize machine learning (ML) on edge devices by enabling layer-wise model partitioning. However, existing SFL approaches suffer significantly from the straggler effect due to the heterogeneous capabilities of edge devices. To address the fundamental challenge, we propose adaptively controlling batch sizes (BSs) and model splitting (MS) for edge devices to overcome resource heterogeneity. We first derive a tight convergence bound of SFL that quantifies the impact of varied BSs and MS on learning performance. Based on the convergence bound, we propose HASFL, a heterogeneity-aware SFL framework capable of adaptively controlling BS and MS to balance communication-computing latency and training convergence in heterogeneous edge networks. Extensive experiments with various datasets validate the effectiveness of HASFL and demonstrate its superiority over state-of-the-art benchmarks.
Chinese: 本文提出HASFL框架,通过自适应调节批量大小和模型分割来解决边缘网络中由设备异构性引起的掉队者效应,在通信计算延迟与训练收敛间取得平衡,实验证明其优于现有最优方法。
English: This paper introduces HASFL, a heterogeneity-aware split federated learning framework that adaptively controls batch sizes and model splitting to mitigate the straggler effect in edge networks, achieving superior performance through optimized latency-convergence balance.
Authors:Francesco Marchiori, Denis Donadel, Alessandro Brighente, Mauro Conti
Abstract:
Electric Vehicles (EVs) are rapidly gaining adoption as a sustainable alternative to fuel-powered vehicles, making secure charging infrastructure essential. Despite traditional authentication protocols, recent results showed that attackers may steal energy through tailored relay attacks. One countermeasure is leveraging the EV's fingerprint on the current exchanged during charging. However, existing methods focus on the final charging stage, allowing malicious actors to consume substantial energy before being detected and repudiated. This underscores the need for earlier and more effective authentication methods to prevent unauthorized charging. Meanwhile, profiling raises privacy concerns, as uniquely identifying EVs through charging patterns could enable user tracking.
In this paper, we propose a framework for uniquely identifying EVs using physical measurements from the early charging stages. We hypothesize that voltage behavior early in the process exhibits similar characteristics to current behavior in later stages. By extracting features from early voltage measurements, we demonstrate the feasibility of EV profiling. Our approach improves existing methods by enabling faster and more reliable vehicle identification. We test our solution on a dataset of 7408 usable charges from 49 EVs, achieving up to 0.86 accuracy. Feature importance analysis shows that near-optimal performance is possible with just 10 key features, improving efficiency alongside our lightweight models. This research lays the foundation for a novel authentication factor while exposing potential privacy risks from unauthorized access to charging data.
中文摘要:本文提出了一种利用早期充电阶段电压测量来识别电动汽车的框架,以加强身份验证并防止能源盗用,同时揭示了充电数据剖析可能带来的隐私风险。
English Summary: This paper introduces a framework for early-stage electric vehicle identification using voltage measurements to enhance authentication and prevent energy theft, while highlighting privacy risks from charging data profiling.
Authors:Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang
Abstract:
The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.
Chinese: 图像生成技术的快速发展催生了对可解释检测方法的需求,通过多阶段优化策略训练的MLLM模型在识别AI生成图像和定位视觉伪影方面表现卓越,并能提供清晰的解释。
English: The rapid progress in image generation necessitates interpretable detection methods, leading to the development of a multi-stage optimized MLLM that excels in identifying AI-generated images and localizing visual artifacts while providing coherent explanations.
Authors:Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton
Abstract:
Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.
中文摘要:本研究提出一种强化学习方法,通过离线强化学习目标实现奖励加权的监督微调,使问答智能体学会提出澄清性问题,相比现有方法能更直接优化奖励并提升语言质量。
English Summary: This study introduces a reinforcement learning approach for teaching question-answering agents to ask clarifying questions, using offline RL objectives that function as reward-weighted supervised fine-tuning to optimize rewards and language quality more effectively than existing methods.
Authors:Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
Abstract:
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.
中文: AR-RAG是一种创新的图像生成范式,通过在每个生成步骤自回归地整合补丁级最近邻检索,能动态适应不断变化的生成需求,并借助提出的DAiD和FAiD框架有效克服过度复制和风格偏差等局限,在多个基准测试中实现了卓越性能。
English: AR-RAG is a novel image generation paradigm that autoregressively integrates patch-level nearest neighbor retrievals at each step, enabling dynamic adaptation to evolving generation needs and overcoming limitations like over-copying and stylistic bias through two proposed frameworks, DAiD and FAiD, achieving superior performance on multiple benchmarks.
Authors:HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, Song Guo
Abstract:
Large Language Models (LLMs) exhibit a notable performance ceiling on complex, multi-faceted tasks, as they often fail to integrate diverse information or adhere to multiple constraints. We posit that such limitation arises when the demands of a task exceed the LLM's effective cognitive load capacity. This interpretation draws a strong analogy to Cognitive Load Theory (CLT) in cognitive science, which explains similar performance boundaries in the human mind, and is further supported by emerging evidence that reveals LLMs have bounded working memory characteristics. Building upon this CLT-grounded understanding, we introduce CoThinker, a novel LLM-based multi-agent framework designed to mitigate cognitive overload and enhance collaborative problem-solving abilities. CoThinker operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. We empirically validate CoThinker on complex problem-solving tasks and fabricated high cognitive load scenarios, demonstrating improvements over existing multi-agent baselines in solution quality and efficiency. Our analysis reveals characteristic interaction patterns, providing insights into the emergence of collective cognition and effective load management, thus offering a principled approach to overcoming LLM performance ceilings.
中文摘要:大语言模型在复杂任务上因认知过载存在性能瓶颈,CoThinker多智能体框架通过专业化分工和结构化通信来分配认知负荷,经实验验证可有效提升协作解决问题的质量与效率。
English Summary: Large Language Models face performance limits on complex tasks due to cognitive overload, which the CoThinker multi-agent framework addresses by distributing cognitive load through specialized agents and structured communication, showing improved problem-solving in empirical tests.
Authors:Lei Lan, Zixuan Lu, Chun Yuan, Weiwei Xu, Hao Su, Huamin Wang, Chenfanfu Jiang, Yin Yang
Abstract:
In parallel simulation, convergence and parallelism are often seen as inherently conflicting objectives. Improved parallelism typically entails lighter local computation and weaker coupling, which unavoidably slow the global convergence. This paper presents a novel GPU algorithm that achieves convergence rates comparable to fullspace Newton's method while maintaining good parallelizability just like the Jacobi method. Our approach is built on a key insight into the phenomenon of overshoot. Overshoot occurs when a local solver aggressively minimizes its local energy without accounting for the global context, resulting in a local update that undermines global convergence. To address this, we derive a theoretically second-order optimal solution to mitigate overshoot. Furthermore, we adapt this solution into a pre-computable form. Leveraging Cubature sampling, our runtime cost is only marginally higher than the Jacobi method, yet our algorithm converges nearly quadratically as Newton's method. We also introduce a novel full-coordinate formulation for more efficient pre-computation. Our method integrates seamlessly with the incremental potential contact method and achieves second-order convergence for both stiff and soft materials. Experimental results demonstrate that our approach delivers high-quality simulations and outperforms state-of-the-art GPU methods with 50 to 100 times better convergence.
Chinese: 本文提出一种GPU算法,通过理论上最优的解决方案和高效预计算来抑制超调现象,在保持雅可比方法良好并行性的同时,实现了与牛顿法相媲美的二阶收敛速度。
English: This paper introduces a GPU algorithm that achieves second-order convergence comparable to Newton's method while maintaining the parallel efficiency of the Jacobi method by mitigating overshoot through theoretically optimal solutions and efficient pre-computation.
Authors:Zijian Yang, Minh-Nghia Phan, Ralf Schlüter, Hermann Ney
Abstract:
Although connectionist temporal classification (CTC) has the label context independence assumption, it can still implicitly learn a context-dependent internal language model (ILM) due to modern powerful encoders. In this work, we investigate the implicit context dependency modeled in the ILM of CTC. To this end, we propose novel context-dependent ILM estimation methods for CTC based on knowledge distillation (KD) with theoretical justifications. Furthermore, we introduce two regularization methods for KD. We conduct experiments on Librispeech and TED-LIUM Release 2 datasets for in-domain and cross-domain evaluation, respectively. Experimental results show that context-dependent ILMs outperform the context-independent priors in cross-domain evaluation, indicating that CTC learns a context-dependent ILM. The proposed label-level KD with smoothing method surpasses other ILM estimation approaches, with more than 13% relative improvement in word error rate compared to shallow fusion.
中文: 本研究证明连接时序分类虽假设标签独立,却隐含学习了上下文相关的内部语言模型,并提出基于知识蒸馏的估计方法,在跨领域评估中显著优于现有方法。
English: This study demonstrates that connectionist temporal classification (CTC) implicitly learns a context-dependent internal language model despite its label independence assumption, and proposes knowledge distillation-based estimation methods that significantly outperform prior approaches in cross-domain evaluations.
Authors:Binghao Ye, Wenjuan Li, Dong Wang, Man Yao, Bing Li, Weiming Hu, Dong Liang, Kun Shang
Abstract:
Spiking Neural Networks (SNNs) are noted for their brain-like computation and energy efficiency, but their performance lags behind Artificial Neural Networks (ANNs) in tasks like image classification and object detection due to the limited representational capacity. To address this, we propose a novel spiking neuron, Integer Binary-Range Alignment Leaky Integrate-and-Fire to exponentially expand the information expression capacity of spiking neurons with only a slight energy increase. This is achieved through Integer Binary Leaky Integrate-and-Fire and range alignment strategy. The Integer Binary Leaky Integrate-and-Fire allows integer value activation during training and maintains spike-driven dynamics with binary conversion expands virtual timesteps during inference. The range alignment strategy is designed to solve the spike activation limitation problem where neurons fail to activate high integer values. Experiments show our method outperforms previous SNNs, achieving 74.19% accuracy on ImageNet and 66.2% mAP@50 and 49.1% mAP@50:95 on COCO, surpassing previous bests with the same architecture by +3.45% and +1.6% and +1.8%, respectively. Notably, our SNNs match or exceed ANNs' performance with the same architecture, and the energy efficiency is improved by 6.3${\times}$.
中文: 脉冲神经网络(SNN)在图像分类等任务中通常表现不如人工神经网络(ANN),但提出的整数二进制范围对齐泄漏积分发放神经元显著提升了SNN的性能和能效,在ImageNet和COCO数据集上取得了具有竞争力的结果。
English: Spiking Neural Networks (SNNs) traditionally underperform Artificial Neural Networks (ANNs) in tasks like image classification, but the proposed Integer Binary-Range Alignment Leaky Integrate-and-Fire neuron significantly boosts SNN performance and energy efficiency, achieving competitive results on ImageNet and COCO datasets.
Authors:Lianming Huang, Haibo Hu, Yufei Cui, Jiacheng Zuo, Shangyu Wu, Nan Guan, Chun Jason Xue
Abstract:
With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.
Chinese: AD-EE框架通过引入自动驾驶领域特性与因果推理,使视觉语言模型能够在推理过程中提前退出,在真实场景测试中最高可降低57.58%延迟并提升44%检测精度。
English: The AD-EE framework is introduced to reduce latency and improve efficiency in autonomous driving by enabling Vision-Language Models to exit early during inference, achieving up to 57.58% faster processing and 44% higher detection accuracy in real-world tests.
Authors:Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang
Abstract:
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
Chinese Summary: 本文提出两种数据高效技术——基于自适应难度的在线数据选择和回放机制,可将大型语言模型的强化学习微调时间减少25%至65%,同时保持原有性能水平。
English Summary: This paper introduces two data-efficient techniques—adaptive difficulty-targeted online data selection and rollout replay—that reduce reinforcement learning fine-tuning time for large language models by 25% to 65% while maintaining performance levels.
Authors:Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, Nick Haber
Abstract:
Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.
中文摘要:自适应长度惩罚(ALP)是一种强化学习方法,根据每个问题的难度动态调整生成长度,在保持性能的同时将令牌使用量减少50%,并将节省的计算资源从简单问题重新分配给困难问题以提高准确性。
English Summary: Adaptive Length Penalty (ALP) is a reinforcement learning method that dynamically adjusts generation length based on each prompt's difficulty, reducing token usage by 50% while maintaining performance and reallocating computational resources from easy to hard problems for improved accuracy.
Authors:Zhaoxuan Tan, Zheng Li, Tianyi Liu, Haodong Wang, Hyokun Yun, Ming Zeng, Pei Chen, Zhihan Zhang, Yifan Gao, Ruijie Wang, Priyanka Nigam, Bing Yin, Meng Jiang
Abstract:
Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/
中文摘要:PUGC框架创新性地利用未标注用户生成内容中的隐含人类偏好来训练大语言模型,无需昂贵的人工标注数据即可实现显著的性能提升和可扩展的领域对齐能力。
English Summary: The PUGC framework introduces a novel approach to training large language models by leveraging implicit human preferences from unlabeled user-generated content, achieving significant performance improvements and scalable domain-specific alignment without costly curated data.
Authors:Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
Abstract:
We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
中文:MedAgentGym是一个可扩展的训练环境,通过涵盖129个类别的72,413项任务增强大型语言模型的生物医学推理能力,其强化学习方法显著提升了性能表现,并成为替代专有模型的高性价比方案。
English: MedAgentGym is a scalable training environment that enhances biomedical reasoning in LLMs through 72,413 tasks across 129 categories, demonstrating significant performance improvements via reinforcement learning and serving as a cost-effective alternative to proprietary models.
Authors:Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi
Abstract:
We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.
中文:MedAgentGym是一个可扩展的训练环境,通过涵盖129个类别的72,413项任务增强大型语言模型的生物医学推理能力,其强化学习方法显著提升了性能表现,并成为替代专有模型的高性价比方案。
English: MedAgentGym is a scalable training environment that enhances biomedical reasoning in LLMs through 72,413 tasks across 129 categories, demonstrating significant performance improvements via reinforcement learning and serving as a cost-effective alternative to proprietary models.
Authors:Wei Luo, Haiming Yao, Yunkang Cao, Qiyu Chen, Ang Gao, Weiming Shen, Wenyong Yu
Abstract:
Anomaly detection (AD) is essential for industrial inspection and medical diagnosis, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Furthermore, we propose a soft version of the INP Coherence Loss and enhance INP-Former by incorporating residual learning, leading to the development of INP-Former++. The proposed method significantly improves detection performance across single-class, multi-class, semi-supervised, few-shot, and zero-shot settings.
Chinese: 异常检测在工业检测和医疗诊断中至关重要,但现有方法常因外观和位置变化难以对齐测试图像与正常参考,从而限制了检测精度。
English: Anomaly detection is crucial in fields like industrial inspection and medical diagnosis, but current methods often struggle with aligning test images to normal references due to appearance and positioning variations, limiting accuracy.
Authors:Jiewen Hu, Leena Mathur, Paul Pu Liang, Louis-Philippe Morency
Abstract:
In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection, eye-gaze estimation, and facial emotion recognition. OpenFace 3.0 contributes a lightweight unified model for facial analysis, trained with a multi-task architecture across diverse populations, head poses, lighting conditions, video resolutions, and facial analysis tasks. By leveraging the benefits of parameter sharing through a unified model and training paradigm, OpenFace 3.0 exhibits improvements in prediction performance, inference speed, and memory efficiency over similar toolkits and rivals state-of-the-art models. OpenFace 3.0 can be installed and run with a single line of code and operate in real-time without specialized hardware. OpenFace 3.0 code for training models and running the system is freely available for research purposes and supports contributions from the community.
Chinese: OpenFace 3.0 是一款开源工具包,提供统一模型进行实时面部分析,涵盖特征点检测、动作单元识别和情绪分析,具有更优的性能与效率。
English: OpenFace 3.0 is an open-source toolkit that provides a unified model for real-time facial analysis, including landmark detection, action unit recognition, and emotion analysis, with enhanced performance and efficiency.
Authors:Yuxuan Wu, Le Wang, Sanping Zhou, Mengnan Liu, Gang Hua, Haoxiang Li
Abstract:
Controllable layout generation aims to create plausible visual arrangements of element bounding boxes within a graphic design according to certain optional constraints, such as the type or position of a specific component. While recent diffusion or flow-matching models have achieved considerable advances in multifarious conditional generation tasks, there remains considerable room for generating optimal arrangements under given conditions. In this work, we propose to carry out layout generation through retrieving by conditions and reference-guided generation. Specifically, we retrieve appropriate layout templates according to given conditions as references. The references are then utilized to guide the denoising or flow-based transport process. By retrieving layouts compatible with the given conditions, we can uncover the potential information not explicitly provided in the given condition. Such an approach offers more effective guidance to the model during the generation process, in contrast to previous models that feed the condition to the model and let the model infer the unprovided layout attributes directly. Meanwhile, we design a condition-modulated attention that selectively absorbs retrieval knowledge, adapting to the difference between retrieved templates and given conditions. Extensive experiment results show that our method successfully produces high-quality layouts that meet the given conditions and outperforms existing state-of-the-art models. Code will be released upon acceptance.
中文: 本研究提出了一种基于检索的布局生成方法,通过条件匹配的模板指导扩散或流匹配模型,利用选择性注意力机制融合隐含信息,从而显著提升生成布局的质量。
English: This study introduces a retrieval-based layout generation method that uses condition-compatible templates to guide diffusion or flow-matching models, enhancing arrangement quality by incorporating implicit information through selective attention mechanisms.
Authors:Chunkit Chan, Yauwai Yim, Hongchuan Zeng, Zhiying Zou, Xinyuan Cheng, Zhifan Sun, Zheye Deng, Kawai Chung, Yuzhuo Ao, Yixiang Fan, Cheng Jiayang, Ercong Nie, Ginny Y. Wong, Helmut Schmid, Hinrich Schütze, Simon See, Yangqiu Song
Abstract:
Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
中文: 本研究提出XToM多语言基准,发现尽管大语言模型在多语言理解上表现优异,但其推断心理状态的能力在不同语言间差异显著,揭示了跨语言复制类人心理理论能力的局限性。
English: The study introduces XToM, a multilingual benchmark revealing that while large language models demonstrate strong multilingual comprehension, their ability to infer mental states varies significantly across languages, highlighting limitations in replicating human-like Theory of Mind across diverse linguistic contexts.
Authors:Qin Xie, Qinghua Zhang, Shuyin Xia, Xinran Zhou, Guoyin Wang
Abstract:
Adaptive Boosting (AdaBoost) faces significant challenges posed by label noise, especially in multiclass classification tasks. Existing methods either lack mechanisms to handle label noise effectively or suffer from high computational costs due to redundant data usage. Inspired by granular computing, this paper proposes granular adaptive boosting (GAdaBoost), a novel two-stage framework comprising a data granulation stage and an adaptive boosting stage, to enhance efficiency and robustness under noisy conditions. To validate its feasibility, an extension of SAMME, termed GAdaBoost.SA, is proposed. Specifically, first, a granular-ball generation method is designed to compress data while preserving diversity and mitigating label noise. Second, the granular ball-based SAMME algorithm focuses on granular balls rather than individual samples, improving efficiency and reducing sensitivity to noise. Experimental results on some noisy datasets show that the proposed approach achieves superior robustness and efficiency compared with existing methods, demonstrating that this work effectively extends AdaBoost and SAMME.
中文: 本文提出GAdaBoost这一两阶段粒计算框架,通过将数据压缩为粒球并优化提升过程,有效增强了AdaBoost在多类分类任务中处理标签噪声的鲁棒性和效率。
English: This paper introduces GAdaBoost, a two-stage granular computing framework that enhances AdaBoost's efficiency and robustness against label noise in multiclass classification by compressing data into granular balls and optimizing the boosting process.
Authors:Muhammad Qasim Ali, Saeejith Nair, Alexander Wong, Yuchen Cui, Yuhao Chen
Abstract:
Structured scene representations are a core component of embodied agents, helping to consolidate raw sensory streams into readable, modular, and searchable formats. Due to their high computational overhead, many approaches build such representations in advance of the task. However, when the task specifications change, such static approaches become inadequate as they may miss key objects, spatial relations, and details. We introduce GraphPad, a modifiable structured memory that an agent can tailor to the needs of the task through API calls. It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes. Together, GraphPad serves as a dynamic workspace that remains complete, current, and aligned with the agent's immediate understanding of the scene and its task. On the OpenEQA benchmark, GraphPad attains 55.3%, a +3.0% increase over an image-only baseline using the same vision-language model, while operating with five times fewer input frames. These results show that allowing online, language-driven refinement of 3-D memory yields more informative representations without extra training or data collection.
Chinese: GraphPad是一种动态结构化记忆系统,通过API调用使具身智能体能够实时调整场景表征,以更少的计算资源提升任务性能,并在OpenEQA基准测试中实现了3.0%的性能提升。
English: GraphPad is a dynamic structured memory system that enables embodied agents to adapt scene representations in real-time through API calls, enhancing task performance with fewer computational resources and achieving a 3.0% improvement on the OpenEQA benchmark.
Authors:Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Abstract:
Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
中文摘要:DS-VTON采用双尺度虚拟试穿框架,先在低分辨率下实现服装结构对齐,再通过扩散模型细化纹理,无需分割掩码即可获得卓越的结构对齐与纹理保真度。
English Summary: DS-VTON is a dual-scale virtual try-on framework that first aligns garment structure at low resolution, then refines textures through diffusion to achieve superior structural and textural fidelity without requiring segmentation masks.
Authors:Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Abstract:
Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.
中文摘要:DS-VTON采用双尺度虚拟试穿框架,先在低分辨率下实现服装结构对齐,再通过扩散模型细化纹理,无需分割掩码即可获得卓越的结构对齐与纹理保真度。
English Summary: DS-VTON is a dual-scale virtual try-on framework that first aligns garment structure at low resolution, then refines textures through diffusion to achieve superior structural and textural fidelity without requiring segmentation masks.
Authors:Yexiao He, Ang Li, Boyi Liu, Zhewei Yao, Yuxiong He
Abstract:
Healthcare decision-making represents one of the most challenging domains for Artificial Intelligence (AI), requiring the integration of diverse knowledge sources, complex reasoning, and various external analytical tools. Current AI systems often rely on either task-specific models, which offer limited adaptability, or general language models without grounding with specialized external knowledge and tools. We introduce MedOrch, a novel framework that orchestrates multiple specialized tools and reasoning agents to provide comprehensive medical decision support. MedOrch employs a modular, agent-based architecture that facilitates the flexible integration of domain-specific tools without altering the core system. Furthermore, it ensures transparent and traceable reasoning processes, enabling clinicians to meticulously verify each intermediate step underlying the system's recommendations. We evaluate MedOrch across three distinct medical applications: Alzheimer's disease diagnosis, chest X-ray interpretation, and medical visual question answering, using authentic clinical datasets. The results demonstrate MedOrch's competitive performance across these diverse medical tasks. Notably, in Alzheimer's disease diagnosis, MedOrch achieves an accuracy of 93.26%, surpassing the state-of-the-art baseline by over four percentage points. For predicting Alzheimer's disease progression, it attains a 50.35% accuracy, marking a significant improvement. In chest X-ray analysis, MedOrch exhibits superior performance with a Macro AUC of 61.2% and a Macro F1-score of 25.5%. Moreover, in complex multimodal visual question answering (Image+Table), MedOrch achieves an accuracy of 54.47%. These findings underscore MedOrch's potential to advance healthcare AI by enabling reasoning-driven tool utilization for multimodal medical data processing and supporting intricate cognitive tasks in clinical decision-making.
中文: MedOrch是一种新型人工智能框架,通过协调专业工具和智能代理提升医疗决策,在阿尔茨海默病诊断和胸部X光解读中凭借透明的多模态推理实现了卓越准确率。
English: MedOrch is a novel AI framework that orchestrates specialized tools and agents to enhance medical decision-making, achieving superior accuracy in diagnosing Alzheimer's disease and interpreting chest X-rays through transparent, multimodal reasoning.
Authors:Ruofan Wu, Youngwon Lee, Fan Shu, Danmei Xu, Seung-won Hwang, Zhewei Yao, Yuxiong He, Feng Yan
Abstract:
Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.
中文: ComposeRAG提出模块化框架,将RAG系统分解为原子组件并引入自反思机制,在多跳问答基准测试中实现了更高的准确性和答案可追溯性。
English: ComposeRAG introduces a modular framework that decomposes RAG systems into atomic components with self-reflection, achieving superior accuracy and grounding in multi-hop QA benchmarks.
Authors:Jiahe Wang, Chenda Li, Wei Wang, Wangyou Zhang, Samuele Cornell, Marvin Sach, Robin Scheibler, Kohei Saijo, Yihui Fu, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian
Abstract:
The Mean Opinion Score (MOS) is fundamental to speech quality assessment. However, its acquisition requires significant human annotation. Although deep neural network approaches, such as DNSMOS and UTMOS, have been developed to predict MOS to avoid this issue, they often suffer from insufficient training data. Recognizing that the comparison of speech enhancement (SE) systems prioritizes a reliable system comparison over absolute scores, we propose URGENT-PK, a novel ranking approach leveraging pairwise comparisons. URGENT-PK takes homologous enhanced speech pairs as input to predict relative quality rankings. This pairwise paradigm efficiently utilizes limited training data, as all pairwise permutations of multiple systems constitute a training instance. Experiments across multiple open test sets demonstrate URGENT-PK's superior system-level ranking performance over state-of-the-art baselines, despite its simple network architecture and limited training data.
中文: 针对平均意见得分获取困难和深度神经网络训练数据不足的问题,提出了URGENT-PK这一利用同源增强语音对进行两两比较的新型排序方法,尽管网络结构简单且训练数据有限,但在多个测试集上展现出优于现有基准模型的系统级排序性能。
English: To address the limitations of Mean Opinion Score (MOS) acquisition and insufficient training data in deep neural network approaches, URGENT-PK is proposed as a novel ranking method using pairwise comparisons of homologous enhanced speech pairs, which demonstrates superior system-level ranking performance across multiple test sets despite its simple architecture and limited training data.
Authors:Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian
Abstract:
The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks in other domains. However, recent studies reveal diminishing returns when scaling speech enhancement data. We focus on a critical factor: prevalent quality issues in ``clean'' training labels within large-scale datasets. This work re-examines this phenomenon and demonstrates that, within large-scale training sets, prioritizing high-quality training data is more important than merely expanding the data volume. Experimental findings suggest that models trained on a carefully curated subset of 700 hours can outperform models trained on the 2,500-hour full dataset. This outcome highlights the crucial role of data curation in scaling speech enhancement systems effectively.
中文: 现代语音增强系统主要依赖神经网络,但单纯扩大数据规模收益递减;研究表明,在大规模训练中注重高质量数据筛选比增加数据量更重要,精心筛选的700小时数据可超越2500小时完整数据集的性能。
English: Modern speech enhancement systems increasingly depend on neural networks, yet simply expanding dataset size yields diminishing returns, as prioritizing high-quality curated data proves more effective than sheer volume, with a 700-hour subset outperforming a 2,500-hour full dataset.
Authors:Roy Colglazier, Jisoo Lee, Haoyu Dong, Hanxue Gu, Yaqian Chen, Joseph Cao, Zafer Yildiz, Zhonghao Liu, Nicholas Konz, Jichen Yang, Jikai Zhang, Yuwen Chen, Lin Li, Adrian Camarena, Maciej A. Mazurowski
Abstract:
The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locations and imaging sequences. A total of 362 MRIs from 160 patients at a single tertiary center (Duke University Health System, 2016-2020) were included, with 316 MRIs from 114 patients used for model development. The model was tested on two separate sets: one with 28 MRIs representing common sequence types, achieving an average Dice Similarity Coefficient (DSC) of 88.45%, and another with 18 MRIs featuring less frequent sequences and abnormalities such as muscular atrophy, hardware, and significant noise, achieving 86.21% DSC. These results demonstrate the feasibility of a fully automated deep learning algorithm for segmenting muscles on MRI across diverse settings. The public release of this model enables consistent, reproducible research into the relationship between musculature and health.
Chinese: 本研究开发了一个公开可用的深度学习模型,用于在磁共振成像中自动分割肌肉,在不同成像条件和解剖部位均表现出高精度,从而为肌肉与健康关系的持续研究提供了可靠工具。
English: This study developed a publicly available deep learning model for automated muscle segmentation in MRIs, achieving high accuracy across diverse imaging conditions and anatomical locations, thereby facilitating consistent research on musculature and health outcomes.
Authors:Jun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang
Abstract:
Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high mobility communications. A significant challenge in MIMO-AFDM systems is the multi-user interference (MUI), which can be effectively addressed by employing precoding techniques. However, the complexity introduced by AFDM makes the precoding process computationally expensive and challenging. To overcome this issue, We combine AFDM channel sparse property and using Preconditioned Conjugate Gradient (PCG) method to iteratively process the precoding, thereby reducing the complexity of the precoding design. Simulation results demonstrate that the proposed sparsification approach, coupled with the PCG method, achieving quite precoding performance while significantly reducing computational complexity. This makes the application of AFDM more feasible and efficient for high-mobility communication scenarios, paving the way for its broader implementation in next-generation communication systems.
中文: AFDM是一种适用于高移动通信的波形,通过结合其信道稀疏性和PCG方法,在保持性能的同时显著降低了预编码复杂度。
English: AFDM is a promising waveform for high-mobility communications, and by leveraging its channel sparsity with the PCG method, precoding complexity is reduced while maintaining performance.
Authors:Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, Dawn Lawrie
Abstract:
The HLTCOE LiveRAG submission utilized the GPT-researcher framework for researching the context of the question, filtering the returned results, and generating the final answer. The retrieval system was a ColBERT bi-encoder architecture, which represents a passage with many dense tokens. Retrieval used a local, compressed index of the FineWeb10-BT collection created with PLAID-X, using a model fine-tuned for multilingual retrieval. Query generation from context was done with Qwen2.5-7B-Instruct, while filtering was accomplished with m2-bert-80M-8k-retrieval. Up to nine passages were used as context to generate an answer using Falcon3-10B. This system placed 5th in the LiveRAG automatic evaluation for correctness with a score of 1.07.
Chinese: HLTCOE LiveRAG系统采用多阶段处理流程,通过GPT-researcher分析问题、ColBERT从压缩多语言索引中检索、Falcon3-10B生成答案,在自动评估中以1.07的正确性得分位列第五。
English: The HLTCOE LiveRAG system employed a multi-stage pipeline using GPT-researcher for question analysis, ColBERT for retrieval from a compressed multilingual index, and Falcon3-10B for answer generation, achieving fifth place in the automatic evaluation with a correctness score of 1.07.
Authors:Mengqi Zhou, Xipeng Wang, Yuxi Wang, Zhaoxiang Zhang
Abstract:
Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but struggle with multi-constraint scenarios. When constraints become numerous, object collisions frequently occur, forcing the removal of furniture items and compromising layout completeness.
To address these limitations, we propose RoomCraft, a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes. Our approach combines a scene generation pipeline with a constraint-driven optimization framework. The pipeline first extracts high-level scene information from user inputs and organizes it into a structured format containing room type, furniture items, and spatial relations. It then constructs a spatial relationship network to represent furniture arrangements and generates an optimized placement sequence using a heuristic-based depth-first search (HDFS) algorithm to ensure layout coherence. To handle complex multi-constraint scenarios, we introduce a unified constraint representation that processes both formal specifications and natural language inputs, enabling flexible constraint-oriented adjustments through a comprehensive action space design. Additionally, we propose a Conflict-Aware Positioning Strategy (CAPS) that dynamically adjusts placement weights to minimize furniture collisions and ensure layout completeness.
Extensive experiments demonstrate that RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts across diverse input modalities.
中文:RoomCraft是一种多阶段流程,通过结合场景生成与约束驱动优化,将多种输入转换为连贯的3D室内场景,在真实性和布局完整性上显著优于现有方法。
English: RoomCraft is a multi-stage pipeline that converts various inputs into coherent 3D indoor scenes by combining scene generation with constraint-driven optimization, significantly outperforming existing methods in realism and layout completeness.
Authors:Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang
Abstract:
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbf{R}efinement-oriented \textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method's effectiveness in enhancing LLM critique-refinement loops.
中文: RCO框架通过精炼信号训练评论模型,专注于推动显著改进的评论,在多项任务中超越传统方法,有效提升大语言模型的评论与优化循环。
English: The RCO framework trains critic models using refinement signals to enhance LLM critiques, outperforming traditional methods across multiple tasks by focusing on critiques that drive meaningful improvements.
Authors:Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, Vikas Chandra
Abstract:
In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.
Chinese: 本研究提出一种利用训练轨迹中的检查点模型优化数据混合的框架,通过聚合其对源数据的影响,在推理基准测试中实现了高达1.93%的性能提升。
English: This study introduces a framework that leverages checkpoint models from training trajectories to optimize data mixtures, achieving up to 1.93% performance improvement on reasoning benchmarks by using their aggregated influence on source data.
Authors:Yuxiang Ge, Jionghao Cheng, Ruiquan Ge, Zhaojie Fang, Gangyong Jia, Xiang Wan, Nannan Li, Ahmed Elazab, Changmiao Wang
Abstract:
Reconstructing 3D visual stimuli from Electroencephalography (EEG) data holds significant potential for applications in Brain-Computer Interfaces (BCIs) and aiding individuals with communication disorders. Traditionally, efforts have focused on converting brain activity into 2D images, neglecting the translation of EEG data into 3D objects. This limitation is noteworthy, as the human brain inherently processes three-dimensional spatial information regardless of whether observing 2D images or the real world. The neural activities captured by EEG contain rich spatial information that is inevitably lost when reconstructing only 2D images, thus limiting its practical applications in BCI. The transition from EEG data to 3D object reconstruction faces considerable obstacles. These include the presence of extensive noise within EEG signals and a scarcity of datasets that include both EEG and 3D information, which complicates the extraction process of 3D visual data. Addressing this challenging task, we propose an innovative EEG encoder architecture that integrates a dual self-attention mechanism. We use a hybrid training strategy to train the EEG Encoder, which includes cross-attention, contrastive learning, and self-supervised learning techniques. Additionally, by employing stable diffusion as a prior distribution and utilizing Variational Score Distillation to train a neural radiation field, we successfully generate 3D objects with similar content and structure from EEG data.
中文: 本研究提出了一种创新的脑电图编码器,采用双重自注意力机制和混合训练策略,通过结合稳定扩散和变分分数蒸馏方法,成功从脑电信号中重建出内容与结构相似的3D物体,克服了信号噪声和数据稀缺等难题。
English: This study introduces a novel EEG encoder with a dual self-attention mechanism and hybrid training strategy to reconstruct 3D objects from EEG data, overcoming challenges like signal noise and data scarcity by integrating stable diffusion and Variational Score Distillation techniques.
Authors:Xin Lu, Xueyang Fu, Jie Xiao, Zihao Fan, Yurui Zhu, Zheng-Jun Zha
Abstract:
While diffusion models demonstrate strong generative capabilities in image restoration (IR) tasks, their complex architectures and iterative processes limit their practical application compared to mainstream reconstruction-based general ordinary IR networks. Existing approaches primarily focus on optimizing network architecture and diffusion paths but overlook the integration of the diffusion training paradigm within general ordinary IR frameworks. To address these challenges, this paper elucidates key principles for adapting the diffusion training paradigm to general IR training through systematic analysis of time-step dependencies, network hierarchies, noise-level relationships, and multi-restoration task correlations, proposing a new IR framework supported by diffusion-based training. To enable IR networks to simultaneously restore images and model generative representations, we introduce a series of regularization strategies that align diffusion objectives with IR tasks, improving generalization in single-task scenarios. Furthermore, recognizing that diffusion-based generation exerts varying influences across different IR tasks, we develop an incremental training paradigm and task-specific adaptors, further enhancing performance in multi-task unified IR. Experiments demonstrate that our method significantly improves the generalization of IR networks in single-task IR and achieves superior performance in multi-task unified IR. Notably, the proposed framework can be seamlessly integrated into existing general IR architectures.
中文摘要:本文提出一种基于扩散模型的训练框架,通过系统化正则化策略与任务适配机制,将生成式训练范式融入通用图像修复网络,在提升单任务泛化能力的同时,显著增强了多任务统一修复性能,且能无缝兼容现有架构。
English Summary: This paper introduces a diffusion-based training framework that enhances image restoration networks by integrating generative principles through systematic regularization and task-specific adaptations, improving both single-task generalization and multi-task unified performance within existing architectures.
Authors:Gongjian Sun, Mingyu Yan, Dengke Han, Runzhen Xue, Duo Wang, Xiaochun Ye, Dongrui Fan
Abstract:
Graph Neural Networks (GNNs) have demonstrated significant success in graph learning and are widely adopted across various critical domains. However, the irregular connectivity between vertices leads to inefficient neighbor aggregation, resulting in substantial irregular and coarse-grained DRAM accesses. This lack of data locality presents significant challenges for execution platforms, ultimately degrading performance. While previous accelerator designs have leveraged on-chip memory and data access scheduling strategies to address this issue, they still inevitably access features at irregular addresses from DRAM. In this work, we propose LiGNN, a hardware-based solution that improves data locality by applying dropout and merge techniques during neighbor aggregation to accelerate GNN training. Unlike conventional algorithm-level dropout methods that primarily aim to improve accuracy while overlooking hardware costs, LiGNN introduces a locality-aware feature dropout mechanism. This approach selectively drops node features with data locality awareness, effectively reducing irregular DRAM accesses without compromising model accuracy. Moreover, by leveraging detailed knowledge of memory layout and organization-including critical alignment constraints-LiGNN strategically merges memory accesses during neighbor aggregation at the DRAM row level, guided by GNN-level semantics. This optimization significantly improves data locality with minimal additional cost. Under the commonly adopted 0.5 dropout rate, LiGNN outperforms state-of-the-art methods, delivering a 1.48~3.02x speedup, reducing DRAM accesses by 34%~55%, and lowering DRAM row activations by 59%~82%, all while maintaining model accuracy.
Chinese: LiGNN是一种基于硬件的解决方案,通过局部感知特征丢弃和策略性内存访问合并技术,在保持模型精度的同时显著提升了图神经网络训练的数据局部性,实现了性能加速并大幅降低了DRAM访问开销。
English: LiGNN is a hardware-based solution that enhances data locality in Graph Neural Network training through locality-aware feature dropout and strategic memory access merging, achieving significant speedups and reduced DRAM usage without compromising model accuracy.
Authors:Jiuyu Liu, Yi Ma, Qihao Peng, Rahim Tafazolli
Abstract:
In this paper, a cluster-aware two-stage multiple-input multiple-output (MIMO) detection method is proposed for direct-to-cell satellite communications. The method achieves computational efficiency by exploiting a distinctive property of satellite MIMO channels: users within the same geographical cluster exhibit highly correlated channel characteristics due to their physical proximity, which typically impedes convergence in conventional iterative MIMO detectors. The proposed method implements a two-stage strategy that first eliminates intra-cluster interference using computationally efficient small matrix inversions, then utilizes these pre-computed matrices to accelerate standard iterative MIMO detectors such as Gauss-Seidel (GS) and symmetric successive over-relaxation (SSOR) for effective inter-cluster interference cancellation. Computer simulations demonstrate that the proposed method achieves more than 12 times faster convergence under perfect channel state information. Even when accounting for channel estimation errors, the method maintains 9 times faster convergence, demonstrating its robustness and effectiveness for next-generation satellite MIMO communications.
本文提出了一种面向卫星通信的集群感知两阶段MIMO检测方法,通过利用地理集群内用户信道高度相关的特性,在理想条件下实现12倍以上的加速收敛,并在存在信道估计误差时仍保持9倍的加速效果。
This paper introduces a cluster-aware two-stage MIMO detection method for satellite communications that accelerates convergence by leveraging user channel correlation within geographical clusters, achieving over 12 times faster processing under ideal conditions and maintaining 9 times speedup with estimation errors.
Authors:Jan Ackermann, Jonas Kulhanek, Shengqu Cai, Haofei Xu, Marc Pollefeys, Gordon Wetzstein, Leonidas Guibas, Songyou Peng
Abstract:
In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.
中文:本文提出的CL-Splats方法通过局部优化和变化检测,实现了对3D高斯溅射场景的高效增量更新,在保持重建质量的同时显著超越了现有技术水平。
English: This paper presents CL-Splats, an incremental update method for 3D Gaussian splatting that efficiently maintains scene reconstructions through localized optimization and change detection, achieving superior performance over existing techniques.
Authors:Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang
Abstract:
Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
中文: UniVLA是一种统一的视觉-语言-动作模型,通过将多模态信号自回归建模为离散标记序列,能够灵活学习视频数据中的时序因果关系,在机器人操作和自动驾驶任务中实现了最先进的性能表现。
English: UniVLA is a unified vision-language-action model that autoregressively processes multimodal signals as discrete tokens, enabling flexible learning from video data and achieving state-of-the-art performance in robotic manipulation and autonomous driving tasks.
Authors:Zhihao Sui, Liang Hu, Jian Cao, Usman Naseem, Zhongyuan Lai, Qi Zhang
Abstract:
Large deep learning models have achieved significant success in various tasks. However, the performance of a model can significantly degrade if it is needed to train on datasets with noisy labels with misleading or ambiguous information. To date, there are limited investigations on how to restore performance when model degradation has been incurred by noisy label data. Inspired by the ``forgetting mechanism'' in neuroscience, which enables accelerating the relearning of correct knowledge by unlearning the wrong knowledge, we propose a robust model restoration and refinement (MRR) framework COLUR, namely Confidence-Oriented Learning, Unlearning and Relearning. Specifically, we implement COLUR with an efficient co-training architecture to unlearn the influence of label noise, and then refine model confidence on each label for relearning. Extensive experiments are conducted on four real datasets and all evaluation results show that COLUR consistently outperforms other SOTA methods after MRR.
中文摘要:受神经科学“遗忘机制”启发,COLUR框架通过置信度导向的遗忘与再学习过程,有效修复因噪声标签导致的模型性能退化,在广泛实验中持续优于现有最优方法。
English Summary: The COLUR framework, inspired by neuroscience's forgetting mechanism, effectively restores model performance degraded by noisy labels through a confidence-oriented process of unlearning and relearning, outperforming existing methods in extensive testing.
Authors:Zhihao Sui, Liang Hu, Jian Cao, Dora D. Liu, Usman Naseem, Zhongyuan Lai, Qi Zhang
Abstract:
Machine Unlearning (MU) technology facilitates the removal of the influence of specific data instances from trained models on request. Despite rapid advancements in MU technology, its vulnerabilities are still underexplored, posing potential risks of privacy breaches through leaks of ostensibly unlearned information. Current limited research on MU attacks requires access to original models containing privacy data, which violates the critical privacy-preserving objective of MU. To address this gap, we initiate an innovative study on recalling the forgotten class memberships from unlearned models (ULMs) without requiring access to the original one. Specifically, we implement a Membership Recall Attack (MRA) framework with a teacher-student knowledge distillation architecture, where ULMs serve as noisy labelers to transfer knowledge to student models. Then, it is translated into a Learning with Noisy Labels (LNL) problem for inferring the correct labels of the forgetting instances. Extensive experiments on state-of-the-art MU methods with multiple real datasets demonstrate that the proposed MRA strategy exhibits high efficacy in recovering class memberships of unlearned instances. As a result, our study and evaluation have established a benchmark for future research on MU vulnerabilities.
Chinese: 机器遗忘技术能够从训练好的模型中移除特定数据,但其脆弱性仍待深入研究,存在通过看似已删除信息泄露导致隐私泄露的风险。
English: Machine Unlearning technology enables the removal of specific data from trained models, yet its vulnerabilities remain underexplored, risking privacy breaches through potential leaks of supposedly erased information.
Authors:Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong
Abstract:
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.
中文: 该摘要介绍了首个专为水道环境设计的标注数据集WaterCaption,以及可边缘部署的多模态模型“大禹”,其采用新型Nano Transformer适配器,在保持性能与效率平衡的同时显著提升了长文本生成能力。
English: This abstract introduces WaterCaption, the first captioning dataset for waterway environments, and Da Yu, an edge-deployable multimodal model with a novel Nano Transformer Adaptor that excels in generating detailed descriptions while balancing performance and efficiency.
Authors:Zheng Zhan, Liliang Ren, Shuohang Wang, Liyuan Liu, Yang Liu, Yeyun Gong, Yanzhi Wang, Yelong Shen
Abstract:
Linear State Space Models (SSMs) offer remarkable performance gains in efficient sequence modeling, with constant inference-time computation and memory complexity. Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations, positioning them as strong alternatives to Transformers for long sequence modeling. However, efficiently scaling the expressive power of SSMs, particularly with Mixture of Experts (MoE), remains challenging, as naive integration attempts often falter or degrade performance. In this work, we introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts. By sharing routing decisions between projection layers and lightweight sub-modules within Mamba across experts, RoM leverages synergies among linear projection experts for effective and efficient sparse scaling of Mamba layers. At a scale of 1.3B active parameters (10B total) and 16K training sequence length, RoM achieves language modeling performance equivalent to a dense Mamba model requiring over 2.3x more active parameters, and demonstrates consistent perplexity across context lengths. Experimental results further show RoM effectively scales hybrid language models, yielding a 23% FLOPS saving compared to dense Mamba scaling for similar performance.
中文:RoM提出了一种利用稀疏线性投影专家混合的新方法,有效扩展了线性状态空间模型的参数规模,在显著减少激活参数和计算成本的同时,实现了优于稠密模型的性能表现。
English: RoM introduces a novel approach to efficiently scale Linear State Space Models using sparse mixtures of linear projection experts, achieving superior language modeling performance with significantly fewer active parameters and computational costs compared to dense models.
Authors:Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
Abstract:
Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
Chinese: 为普及先进的多模态图像生成技术,我们推出了基于GPT-4o合成的ShareGPT-4o-Image数据集和Janus-4o模型,该模型不仅提升了文生图能力,还新增了图文生图功能,并通过高效训练实现了卓越性能。
English: To democratize advanced multimodal image generation, we introduce ShareGPT-4o-Image, a dataset synthesized with GPT-4o, and Janus-4o, a model that enhances text-to-image capabilities and introduces text-and-image-to-image generation with efficient training.
Authors:Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin, Hao Chen
Abstract:
LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.
中文摘要:FedMRG是首个利用联邦学习实现多中心隐私保护的医疗报告生成框架,通过低秩分解降低通信成本,并采用双重适配器机制解决数据异质性,从而在保护隐私的同时生成临床准确的医疗报告。
English Summary: FedMRG is a pioneering federated learning framework that enables privacy-preserving, multi-center development of LLM-driven medical report generation models by employing low-rank factorization for communication efficiency and dual-adapter mechanisms to handle data heterogeneity.
Authors:Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada
Abstract:
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.
中文: 协作式3D语义占据预测通过多智能体数据融合克服单车感知局限,新建立的基准数据集和基线模型展现出更优性能,且在长距离预测中优势更为显著。
English: Collaborative 3D semantic occupancy prediction overcomes single-vehicle limitations by enabling multi-agent data fusion, with a new benchmark dataset and baseline model showing superior performance, especially over longer ranges.
Authors:Fanchen Bu, Kijung Shin
Abstract:
In unsupervised combinatorial optimization (UCO), during training, one aims to have continuous decisions that are promising in a probabilistic sense for each training instance, which enables end-to-end training on initially discrete and non-differentiable problems. At the test time, for each test instance, starting from continuous decisions, derandomization is typically applied to obtain the final deterministic decisions. Researchers have developed more and more powerful test-time derandomization schemes to enhance the empirical performance and the theoretical guarantee of UCO methods. However, we notice a misalignment between training and testing in the existing UCO methods. Consequently, lower training losses do not necessarily entail better post-derandomization performance, even for the training instances without any data distribution shift. Empirically, we indeed observe such undesirable cases. We explore a preliminary idea to better align training and testing in UCO by including a differentiable version of derandomization into training. Our empirical exploration shows that such an idea indeed improves training-test alignment, but also introduces nontrivial challenges into training.
中文: 本研究揭示了无监督组合优化中训练与测试的不匹配问题,即较低的训练损失并不能保证去随机化后的更好性能,并提出在训练中引入可微分的去随机化过程以改善对齐,但这也带来了新的训练挑战。
English: The study identifies a misalignment between training and testing in unsupervised combinatorial optimization, where lower training losses do not guarantee better post-derandomization performance, and proposes integrating a differentiable derandomization process during training to improve alignment, though it introduces new training challenges.
Authors:Nadine Imholz, Maurice Brunner, Nicolas Baumann, Edoardo Ghignone, Michele Magno
Abstract:
Unrestricted multi-agent racing presents a significant research challenge, requiring decision-making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision-making process or been restricted to single-opponent scenarios, this work enables arbitrary multi-opponent head-to-head racing while considering the opponents' future intent. The proposed method employs a KF-based multi-opponent tracker to effectively perform opponent ReID by associating them across observations. Simultaneously, spatial and velocity GPR is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car, achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%-point improvement in safety at the same speed as the previous SotA. These results highlight its potential for high-performance autonomous racing.
中文: 本研究提出一种多智能体自主赛车方法,通过基于卡尔曼滤波的对手追踪和高斯过程回归轨迹预测来考量对手未来意图,在实物测试中实现了91.65%的超车成功率并显著提升了安全性。
English: This research introduces a method for multi-agent autonomous racing that incorporates opponent future intent prediction using KF-based tracking and GPR trajectory analysis, achieving a 91.65% overtaking success rate and improved safety in physical tests.
Authors:Langzhang Liang, Fanchen Bu, Zixing Song, Zenglin Xu, Shirui Pan, Kijung Shin
Abstract:
The message-passing paradigm of Graph Neural Networks often struggles with exchanging information across distant nodes typically due to structural bottlenecks in certain graph regions, a limitation known as \textit{over-squashing}. To reduce such bottlenecks, \textit{graph rewiring}, which modifies graph topology, has been widely used. However, existing graph rewiring techniques often overlook the need to preserve critical properties of the original graph, e.g., \textit{spectral properties}. Moreover, many approaches rely on increasing edge count to improve connectivity, which introduces significant computational overhead and exacerbates the risk of over-smoothing. In this paper, we propose a novel graph rewiring method that leverages \textit{spectrum-preserving} graph \textit{sparsification}, for mitigating over-squashing. Our method generates graphs with enhanced connectivity while maintaining sparsity and largely preserving the original graph spectrum, effectively balancing structural bottleneck reduction and graph property preservation. Experimental results validate the effectiveness of our approach, demonstrating its superiority over strong baseline methods in classification accuracy and retention of the Laplacian spectrum.
中文: 所提出的图重布线方法采用保持谱特性的稀疏化技术,通过增强连通性缓解过挤压问题,同时维持图的稀疏性并保留原始谱特征,在分类精度和拉普拉斯谱保持方面优于基线方法。
English: The proposed graph rewiring method uses spectrum-preserving sparsification to mitigate over-squashing by enhancing connectivity while maintaining graph sparsity and preserving spectral properties, outperforming baselines in classification accuracy and Laplacian spectrum retention.
Authors:Ziran Zhu, Tongda Xu, Minye Huang, Dailan He, Xingtong Ge, Xinjie Zhang, Ling Li, Yan Wang
Abstract:
Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is $\approx 0.1$s, $0.1-10$s and $\ge 10$s. Our approach: 1). improves the decoding time of training-free codec from 1 min to $0.1-10$s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material.
Chinese Summary: 本文提出一种无需训练的算法,可在0.1-10秒解码时间内提升现有图像编解码器的感知质量,在保持理论保证的同时,其FID指标优于传统基于条件生成模型的方法。
English Summary: This paper introduces a training-free algorithm that enhances the perceptual quality of existing image codecs with fast decoding times ranging from 0.1 to 10 seconds, while maintaining theoretical guarantees and outperforming previous methods in perceptual metrics like FID.
Authors:Ryota Okumura, Tadahiro Taniguchi, Akira Taniguchi, Yoshinobu Hagiwara
Abstract:
We propose co-creative learning as a novel paradigm where humans and AI, i.e., biological and artificial agents, mutually integrate their partial perceptual information and knowledge to construct shared external representations, a process we interpret as symbol emergence. Unlike traditional AI teaching based on unilateral knowledge transfer, this addresses the challenge of integrating information from inherently different modalities. We empirically test this framework using a human-AI interaction model based on the Metropolis-Hastings naming game (MHNG), a decentralized Bayesian inference mechanism. In an online experiment, 69 participants played a joint attention naming game (JA-NG) with one of three computer agent types (MH-based, always-accept, or always-reject) under partial observability. Results show that human-AI pairs with an MH-based agent significantly improved categorization accuracy through interaction and achieved stronger convergence toward a shared sign system. Furthermore, human acceptance behavior aligned closely with the MH-derived acceptance probability. These findings provide the first empirical evidence for co-creative learning emerging in human-AI dyads via MHNG-based interaction. This suggests a promising path toward symbiotic AI systems that learn with humans, rather than from them, by dynamically aligning perceptual experiences, opening a new venue for symbiotic AI alignment.
中文: 协同创造式学习通过人类与AI在部分可观测环境下基于Metropolis-Hastings命名游戏的交互,实现了双方感知经验的动态对齐,首次实证验证了人机通过双向符号涌现而非单向知识传授达成协同进化的新模式。
English: Co-creative learning enables humans and AI to jointly construct shared symbols by integrating their partial perceptions through decentralized Bayesian interaction, achieving improved categorization and sign convergence without unilateral knowledge transfer.
Authors:Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li, Zhong-zhi Li, Peijie Wang, Jie Wu, Lei Ji, Yelong Shen, Yujiu Yang, Yeyun Gong
Abstract:
Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.
Chinese: 近期研究提出PeRL强化学习方法,通过多阶段训练和图像序列排列增强多图像位置推理能力,在多图像基准测试中实现最优性能,同时保持单图像任务的竞争力。
English: Recent research proposes PeRL, a reinforcement learning approach with multi-stage training and image sequence permutation, which achieves state-of-the-art performance on multi-image reasoning benchmarks while maintaining strong single-image task capabilities.
Authors:Xueyang Feng, Jingsen Zhang, Jiakai Tang, Wei Li, Guohao Cai, Xu Chen, Quanyu Dai, Yue Zhu, Zhenhua Dong
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA's interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods.
中文: 本文提出ECPO这一新型多轮偏好优化范式,利用期望确认理论建模用户满意度演变,在消除现有方法采样开销的同时,有效提升了对话推荐系统的交互能力。
English: This paper introduces ECPO, a novel multi-turn preference optimization paradigm that models user satisfaction evolution using Expectation Confirmation Theory to efficiently enhance conversational recommendation agents' performance while eliminating the sampling overhead of existing methods.
Authors:Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
Abstract:
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
中文: Ego-R1框架采用结构化工具思维链方法,通过强化学习智能体动态选择工具进行分步推理,有效解决了超长第一人称视频的理解难题,将处理时长从数小时扩展至数周。
English: The Ego-R1 framework introduces a structured Chain-of-Tool-Thought process, where a reinforcement learning agent dynamically selects tools for step-by-step reasoning to effectively handle ultra-long egocentric video understanding tasks, extending coverage from hours to weeks.
Authors:Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Michael A. Riegler, PÃ¥l Halvorsen
Abstract:
Dynamic facial emotion is essential for believable AI-generated avatars, yet most systems remain visually static, limiting their use in simulations like virtual training for investigative interviews with abused children. We present a real-time architecture combining Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to generate facial expressions from vocal prosody in photorealistic child avatars. Due to limited TTS options, both avatars were voiced using young adult female models from two systems to better fit character profiles, introducing a voice-age mismatch. This confound may affect audiovisual alignment. We used a two-PC setup to decouple speech generation from GPU-intensive rendering, enabling low-latency interaction in desktop and VR. A between-subjects study (N=70) compared audio+visual vs. visual-only conditions as participants rated emotional clarity, facial realism, and empathy for avatars expressing joy, sadness, and anger. While emotions were generally recognized - especially sadness and joy - anger was harder to detect without audio, highlighting the role of voice in high-arousal expressions. Interestingly, silencing clips improved perceived realism by removing mismatches between voice and animation, especially when tone or age felt incongruent. These results emphasize the importance of audiovisual congruence: mismatched voice undermines expression, while a good match can enhance weaker visuals - posing challenges for emotionally coherent avatars in sensitive contexts.
中文摘要:本研究开发了一种实时系统,通过语音韵律驱动生成具有动态面部表情的逼真儿童虚拟形象,发现尽管悲伤和喜悦等情绪可被识别,但声音与视觉的不匹配会削弱愤怒情绪的感知并影响整体情感一致性,强调了在敏感应用场景中视听协调的至关重要性。
English Summary: This study introduces a real-time system for generating photorealistic child avatars with dynamic facial expressions driven by vocal prosody, revealing that while emotions like sadness and joy were recognizable, voice-visual mismatches impaired anger perception and overall emotional coherence, highlighting the critical need for audiovisual alignment in sensitive applications.
Authors:Tianze Wang, Yifei Liu, Chen Chen, Pengfei Zuo, Jiawei Zhang, Qizhen Weng, Yin Chen, Zhenhua Han, Jieru Zhao, Quan Chen, Minyi Guo
Abstract:
Modern AI clusters, which host diverse workloads like data pre-processing, training and inference, often store the large-volume data in cloud storage and employ caching frameworks to facilitate remote data access. To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads. However, existing cache management strategies, designed for specific workloads, struggle to handle the heterogeneous AI workloads in a cluster -- which usually exhibit heterogeneous access patterns and item storage granularities. In this paper, we propose IGTCache, a unified, high-efficacy cache for modern AI clusters. IGTCache leverages a hierarchical access abstraction, AccessStreamTree, to organize the recent data accesses in a tree structure, facilitating access pattern detection at various granularities. Using this abstraction, IGTCache applies hypothesis testing to categorize data access patterns as sequential, random, or skewed. Based on these detected access patterns and granularities, IGTCache tailors optimal cache management strategies including prefetching, eviction, and space allocation accordingly. Experimental results show that IGTCache increases the cache hit ratio by 55.6% over state-of-the-art caching frameworks, reducing the overall job completion time by 52.2%.
中文摘要:IGTCache是一种面向AI集群的统一缓存系统,通过分层访问模式识别和自适应管理策略,大幅提升缓存效率并缩短作业完成时间。
English Summary: IGTCache is a unified caching system for AI clusters that uses hierarchical access pattern detection and adaptive strategies to significantly improve cache performance and reduce job completion times.
Authors:Haoyu Dong, Yuwen Chen, Hanxue Gu, Nicholas Konz, Yaqian Chen, Qihang Li, Maciej A. Mazurowski
Abstract:
The widespread use of Magnetic Resonance Imaging (MRI) in combination with deep learning shows promise for many high-impact automated diagnostic and prognostic tools. However, training new models requires large amounts of labeled data, a challenge due to high cost of precise annotations and data privacy. To address this issue, we introduce the MRI-CORE, a vision foundation model trained using more than 6 million slices from over 110 thousand MRI volumes across 18 body locations. Our experiments show notable improvements in performance over state-of-the-art methods in 13 data-restricted segmentation tasks, as well as in image classification, and zero-shot segmentation, showing the strong potential of MRI-CORE to enable data-efficient development of artificial intelligence models. We also present data on which strategies yield most useful foundation models and a novel analysis relating similarity between pre-training and downstream task data with transfer learning performance. Our model is publicly available with a permissive license.
中文: MRI-CORE基础模型基于超过600万张MRI切片训练而成,在数据受限任务中显著优于现有方法,并展现出在医学影像领域开发数据高效人工智能模型的强大潜力。
English: The MRI-CORE foundation model, trained on over 6 million MRI slices, significantly outperforms existing methods in data-restricted tasks and demonstrates strong potential for developing data-efficient AI models in medical imaging.
Authors:Hanxue Gu, Yaqian Chen, Jisoo Lee, Diego Schaps, Regina Woody, Roy Colglazier, Maciej A. Mazurowski, Christopher Mantyh
Abstract:
Objective: To evaluate whether preoperative body composition metrics automatically extracted from CT scans can predict postoperative outcomes after colectomy, either alone or combined with clinical variables or existing risk predictors. Main outcomes and measures: The primary outcome was the predictive performance for 1-year all-cause mortality following colectomy. A Cox proportional hazards model with 1-year follow-up was used, and performance was evaluated using the concordance index (C-index) and Integrated Brier Score (IBS). Secondary outcomes included postoperative complications, unplanned readmission, blood transfusion, and severe infection, assessed using AUC and Brier Score from logistic regression. Odds ratios (OR) described associations between individual CT-derived body composition metrics and outcomes. Over 300 features were extracted from preoperative CTs across multiple vertebral levels, including skeletal muscle area, density, fat areas, and inter-tissue metrics. NSQIP scores were available for all surgeries after 2012.
中文摘要:本研究评估术前CT提取的身体成分指标能否独立或结合临床变量预测结肠切除术后结果,通过统计模型分析一年死亡率和并发症等指标。
English Summary: This study assesses whether preoperative CT-derived body composition metrics can predict postoperative outcomes after colectomy, either independently or combined with clinical variables, using statistical models to evaluate mortality and complications.
Authors:Zhaoyang Wang, Wen Lu, Jie Li, Lihuo He, Maoguo Gong, Xinbo Gao
Abstract:
Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.
中文: EyeSimVQA提出了一种基于自由能自修复的双分支视频质量评估框架,通过结合美学与技术分析及仿生眼动建模,在解决时序动态挑战的同时实现了领先的性能和可解释性。
English: EyeSimVQA introduces a dual-branch VQA framework using free-energy-based self-repair to address temporal challenges, combining aesthetic and technical analysis with biologically inspired gaze modeling for state-of-the-art performance and interpretability.
Authors:Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Ãstün, Sara Hooker
Abstract:
Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.
Chinese: 研究表明,采用覆盖语言更广的通用分词器能显著提升模型的语言可塑性,使新语言适应率最高提升20.2%,同时对原始语言性能影响极小。
English: This study demonstrates that employing a universal tokenizer trained on more languages than the primary pretraining set significantly enhances language plasticity, enabling up to 20.2% higher adaptation to new languages with minimal performance loss on original languages.
Authors:Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Abstract:
Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.
中文摘要:提出了一种基于小波的新方法,通过捕捉多尺度时频特征来分析生理信号,引入了预训练的EMG和ECG模型及统一多模态框架,在处理噪声和变异性方面优于现有方法。
English Summary: A novel wavelet-based method is proposed to analyze physiological signals by capturing multi-scale time-frequency features, introducing pretrained EMG and ECG models and a unified multimodal framework that outperforms existing approaches in handling noise and variability.
Authors:Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Abstract:
Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave
中文摘要:提出了一种基于小波的新方法,通过捕捉多尺度时频特征来分析生理信号,引入了预训练的EMG和ECG模型及统一多模态框架,在处理噪声和变异性方面优于现有方法。
English Summary: A novel wavelet-based method is proposed to analyze physiological signals by capturing multi-scale time-frequency features, introducing pretrained EMG and ECG models and a unified multimodal framework that outperforms existing approaches in handling noise and variability.
Authors:Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand
Abstract:
To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.
中文: 本文提出AWP方法,通过投影梯度下降统一实现激活感知权重剪枝与量化,在性能上超越现有技术,并为剪枝提供了理论收敛保证。
English: This paper introduces AWP, a unified method for activation-aware weight pruning and quantization using projected gradient descent, which outperforms existing techniques and includes theoretical convergence guarantees for pruning.
Authors:Maurice Brunner, Edoardo Ghignone, Nicolas Baumann, Michele Magno
Abstract:
Autonomous racing has emerged as a crucial testbed for autonomous driving algorithms, necessitating a simulation environment for both vehicle dynamics and sensor behavior. Striking the right balance between vehicle dynamics and sensor accuracy is crucial for pushing vehicles to their performance limits. However, autonomous racing developers often face a trade-off between accurate vehicle dynamics and high-fidelity sensor simulations. This paper introduces R-CARLA, an enhancement of the CARLA simulator that supports holistic full-stack testing, from perception to control, using a single system. By seamlessly integrating accurate vehicle dynamics with sensor simulations, opponents simulation as NPCs, and a pipeline for creating digital twins from real-world robotic data, R-CARLA empowers researchers to push the boundaries of autonomous racing development. Furthermore, it is developed using CARLA's rich suite of sensor simulations. Our results indicate that incorporating the proposed digital-twin framework into R-CARLA enables more realistic full-stack testing, demonstrating a significant reduction in the Sim-to-Real gap of car dynamics simulation by 42% and by 82% in the case of sensor simulation across various testing scenarios.
中文摘要:R-CARLA作为CARLA模拟器的增强版本,通过整合精确车辆动力学与高保真传感器模拟及数字孪生框架,显著缩小了自动驾驶赛车开发中的仿真与现实差距。
English Summary: R-CARLA enhances the CARLA simulator by integrating accurate vehicle dynamics with high-fidelity sensor simulations and a digital-twin framework, significantly reducing the simulation-to-reality gap for autonomous racing development.
Authors:Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy F. Chen
Abstract:
While Generative AI has demonstrated strong potential and versatility in content generation, its application to educational contexts presents several challenges. Models often fail to align with curriculum standards and maintain grade-appropriate reading levels consistently. Furthermore, STEM education poses additional challenges in balancing scientific explanations with everyday language when introducing complex and abstract ideas and phenomena to younger students. In this work, we propose COGENT, a curriculum-oriented framework for generating grade-appropriate educational content. We incorporate three curriculum components (science concepts, core ideas, and learning objectives), control readability through length, vocabulary, and sentence complexity, and adopt a ``wonder-based'' approach to increase student engagement and interest. We conduct a multi-dimensional evaluation via both LLM-as-a-judge and human expert analysis. Experimental results show that COGENT consistently produces grade-appropriate passages that are comparable or superior to human references. Our work establishes a viable approach for scaling adaptive and high-quality learning resources.
Chinese: COGENT是一个面向课程的框架,通过整合课程要素、控制可读性并采用基于好奇心的教学方法,生成适合年级的教育内容,实验证明其在制作适应性强的优质学习资源方面与人类参考材料相当或更优。
English: COGENT is a curriculum-oriented framework that generates grade-appropriate educational content by incorporating curriculum components, controlling readability, and using a wonder-based approach, proving to be comparable or superior to human references in producing adaptive, high-quality learning resources.
Authors:Abigail Copiaco, Christian Ritz, Yassine Himeur, Valsamma Eapen, Ammar Albanna, Wathiq Mansoor
Abstract:
The prevalence of Autism Spectrum Disorder (ASD) has surged rapidly over the past decade, posing significant challenges in communication, behavior, and focus for affected individuals. Current diagnostic techniques, though effective, are time-intensive, leading to high social and economic costs. This work introduces an AI-powered assistive technology designed to streamline ASD diagnosis and management, enhancing convenience for individuals with ASD and efficiency for caregivers and therapists. The system integrates transfer learning with image transforms derived from eye gaze variables to diagnose ASD. This facilitates and opens opportunities for in-home periodical diagnosis, reducing stress for individuals and caregivers, while also preserving user privacy through the use of image transforms. The accessibility of the proposed method also offers opportunities for improved communication between guardians and therapists, ensuring regular updates on progress and evolving support needs. Overall, the approach proposed in this work ensures timely, accessible diagnosis while protecting the subjects' privacy, improving outcomes for individuals with ASD.
中文: 本研究提出一种基于眼动数据图像转换和迁移学习的AI辅助系统,能够实现高效的家庭自闭症谱系障碍诊断与管理,在保障隐私的同时提升诊断可及性,并改善护理人员与治疗师之间的沟通。
English: This study presents an AI-assisted system using transfer learning and image transforms from eye gaze data to enable efficient, in-home ASD diagnosis and management, enhancing accessibility while safeguarding privacy and improving communication between caregivers and therapists.
Authors:Xingbo Fu, Zehong Wang, Zihan Chen, Jiazheng Li, Yaochen Zhu, Zhenyu Lei, Cong Shen, Yanfang Ye, Chuxu Zhang, Jundong Li
Abstract:
Graph learning models have demonstrated great prowess in learning expressive representations from large-scale graph data in a wide variety of real-world scenarios. As a prevalent strategy for training powerful graph learning models, the "pre-training, adaptation" scheme first pre-trains graph learning models on unlabeled graph data in a self-supervised manner and then adapts them to specific downstream tasks. During the adaptation phase, graph prompting emerges as a promising approach that learns trainable prompts while keeping the pre-trained graph learning models unchanged. In this paper, we present a systematic review of recent advancements in graph prompting. First, we introduce representative graph pre-training methods that serve as the foundation step of graph prompting. Next, we review mainstream techniques in graph prompting and elaborate on how they design learnable prompts for graph prompting. Furthermore, we summarize the real-world applications of graph prompting from different domains. Finally, we discuss several open challenges in existing studies with promising future directions in this field.
图提示是一种高效的适配方法,它在保持预训练图模型不变的同时学习可训练的提示,本文系统综述了该领域最新进展,包括预训练基础、提示设计技术及多样化的实际应用。
Graph prompting is an efficient adaptation method that learns trainable prompts while keeping pre-trained graph models fixed, with recent advancements systematically reviewed including pre-training foundations, prompt design techniques, and diverse real-world applications.
Authors:Yijie Deng, Shuaihang Yuan, Congcong Wen, Hao Huang, Anthony Tzes, Geeta Chandra Raju Bethala, Yi Fang
Abstract:
Spatial awareness is a critical capability for embodied agents, as it enables them to anticipate and reason about unobserved regions. The primary challenge arises from learning the distribution of indoor semantics, complicated by sparse, imbalanced object categories and diverse spatial scales. Existing methods struggle to robustly generate unobserved areas in real time and do not generalize well to new environments. To this end, we propose \textbf{MapBERT}, a novel framework designed to effectively model the distribution of unseen spaces. Motivated by the observation that the one-hot encoding of semantic maps aligns naturally with the binary structure of bit encoding, we, for the first time, leverage a lookup-free BitVAE to encode semantic maps into compact bitwise tokens. Building on this, a masked transformer is employed to infer missing regions and generate complete semantic maps from limited observations. To enhance object-centric reasoning, we propose an object-aware masking strategy that masks entire object categories concurrently and pairs them with learnable embeddings, capturing implicit relationships between object embeddings and spatial tokens. By learning these relationships, the model more effectively captures indoor semantic distributions crucial for practical robotic tasks. Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation, balancing computational efficiency with accurate reconstruction of unobserved regions.
中文:MapBERT是一种创新框架,通过BitVAE和掩码变换器有效建模并生成未观测的室内语义地图,在Gibson基准测试中实现了精度与计算效率的最优平衡。
English: MapBERT is a novel framework that uses a BitVAE and masked transformer to efficiently model and generate unobserved indoor semantic maps, achieving state-of-the-art performance in accuracy and computational efficiency on Gibson benchmarks.
Authors:Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang
Abstract:
Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.
中文:LaTtE-Flow模型通过基于流的创新架构统一图像理解与生成,其按时间步激活专用层组的设计在保持竞争力的同时,实现了比现有统一模型快6倍的推理速度。
English: The proposed LaTtE-Flow model unifies image understanding and generation through a novel flow-based architecture that activates specialized layer groups per timestep, achieving competitive performance with 6x faster inference than existing unified models.
Authors:Wafaa Kasri, Yassine Himeur, Abigail Copiaco, Wathiq Mansoor, Ammar Albanna, Valsamma Eapen
Abstract:
Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.
中文: 本研究提出了一种结合视觉Transformer和视觉Mamba的混合深度学习模型,通过眼动追踪数据和多模态融合实现了卓越的自闭症谱系障碍检测性能,在资源有限环境中展现出高精度和可解释性的临床应用潜力。
English: This study introduces a hybrid deep learning model combining Vision Transformers and Vision Mamba that uses eye-tracking data with multimodal fusion to achieve superior ASD detection performance, demonstrating high accuracy and interpretability for potential clinical applications in resource-limited settings.
Authors:Xiaokun Zhang, Bo Xu, Fenglong Ma, Zhizheng Wang, Liang Yang, Hongfei Lin
Abstract:
Session-based recommendation aims to predict intents of anonymous users based on limited behaviors. With the ability in alleviating data sparsity, contrastive learning is prevailing in the task. However, we spot that existing contrastive learning based methods still suffer from three obstacles: (1) they overlook item-level sparsity and primarily focus on session-level sparsity; (2) they typically augment sessions using item IDs like crop, mask and reorder, failing to ensure the semantic consistency of augmented views; (3) they treat all positive-negative signals equally, without considering their varying utility. To this end, we propose a novel multi-modal adaptive contrastive learning framework called MACL for session-based recommendation. In MACL, a multi-modal augmentation is devised to generate semantically consistent views at both item and session levels by leveraging item multi-modal features. Besides, we present an adaptive contrastive loss that distinguishes varying contributions of positive-negative signals to improve self-supervised learning. Extensive experiments on three real-world datasets demonstrate the superiority of MACL over state-of-the-art methods.
中文: 本文提出MACL多模态自适应对比学习框架,通过生成语义一致的视图并采用自适应对比损失区分信号贡献,解决了会话推荐中的关键问题,在多个数据集上验证了其优越性能。
English: This paper introduces MACL, a multi-modal adaptive contrastive learning framework that addresses limitations in session-based recommendation by generating semantically consistent views and distinguishing signal contributions through adaptive contrastive loss, demonstrating superior performance over existing methods.
Authors:Bimsara Pathiraja, Maitreya Patel, Shivam Singh, Yezhou Yang, Chitta Baral
Abstract:
Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.
中文:现有图像编辑方法难以处理复杂多对象场景,为此我们提出RefEdit-Bench基准和RefEdit模型,该模型仅需少量训练数据即可实现最优性能,并显著提升传统基准表现。
English: Current image editing methods struggle with complex multi-object scenes, prompting the introduction of RefEdit-Bench benchmark and RefEdit model, which achieves state-of-the-art performance with minimal training data while enhancing traditional benchmarks.
Authors:Wenshuo Chen, Kuimou Yu, Haozhe Jia, Kaishen Yuan, Zexu Huang, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, Yutao Yue
Abstract:
While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.
中文:ANT架构通过语义时序自适应模块和动态无分类器引导,解决了扩散模型中语义条件与时序频率需求不匹配的问题,在分阶段去噪中实现了结构规划与细节优化的自适应协调,显著提升了运动生成的语义对齐性能。
English: The proposed ANT architecture addresses the temporal-frequency mismatch in diffusion models by adaptively partitioning denoising stages for structural planning and detail refinement, achieving state-of-the-art performance through semantic-temporal modules and dynamic guidance scheduling.
Authors:Zhengyuan Liu, Geyu Lin, Hui Li Tan, Huayun Zhang, Yanfeng Lu, Xiaoxue Gao, Stella Xin Yin, He Sun, Hock Huan Goh, Lung Hsiang Wong, Nancy F. Chen
Abstract:
The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.
中文:生成式人工智能在教育中提升了语言学习的个性化,但需克服跨文化和儿童友好设计的挑战;SingaKids系统通过多语言沉浸式交互和分层支持,实证研究证实其对不同水平的小学生均能实现有效教学。
English: Generative AI in education enhances personalized language learning but faces challenges in cross-cultural adaptability and child-friendly design, addressed by the multilingual SingaKids system through immersive, scaffolded interactions proven effective for young learners.
Authors:Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman
Abstract:
Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.
中文: 大语言模型在多选项问答中存在置信度失真问题,而提出的自集成方法通过分组选项并集成预测有效解决了该问题,无需参数调优即可实现。
English: Large Language Models suffer from confidence distortion in multi-choice questions with many options, but the proposed Self-ensemble method effectively mitigates this by grouping choices and ensembling predictions without requiring parameter tuning.
Authors:Young Jin Park, Francois Germain, Jing Liu, Ye Wang, Toshiaki Koike-Akino, Gordon Wichern, Navid Azizan, Christopher R. Laughman, Ankush Chakrabarty
Abstract:
Decision-making in building energy systems critically depends on the predictive accuracy of relevant time-series models. In scenarios lacking extensive data from a target building, foundation models (FMs) represent a promising technology that can leverage prior knowledge from vast and diverse pre-training datasets to construct accurate probabilistic predictors for use in decision-making tools. This paper investigates the applicability and fine-tuning strategies of time-series foundation models (TSFMs) in building energy forecasting. We analyze both full fine-tuning and parameter-efficient fine-tuning approaches, particularly low-rank adaptation (LoRA), by using real-world data from a commercial net-zero energy building to capture signals such as room occupancy, carbon emissions, plug loads, and HVAC energy consumption. Our analysis reveals that the zero-shot predictive performance of TSFMs is generally suboptimal. To address this shortcoming, we demonstrate that employing either full fine-tuning or parameter-efficient fine-tuning significantly enhances forecasting accuracy, even with limited historical data. Notably, fine-tuning with low-rank adaptation (LoRA) substantially reduces computational costs without sacrificing accuracy. Furthermore, fine-tuned TSFMs consistently outperform state-of-the-art deep forecasting models (e.g., temporal fusion transformers) in accuracy, robustness, and generalization across varying building zones and seasonal conditions. These results underline the efficacy of TSFMs for practical, data-constrained building energy management systems, enabling improved decision-making in pursuit of energy efficiency and sustainability.
中文摘要:通过对时序基础模型进行微调,即使在数据有限的情况下也能显著提升建筑能耗预测的准确性和计算效率,其表现优于现有模型,为能源管理决策提供有力支持。
English Summary: Fine-tuning time-series foundation models significantly enhances building energy forecasting accuracy and computational efficiency, outperforming existing models even with limited data to support energy management decisions.
Authors:Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie
Abstract:
Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: https://dukguo.github.io/StreamFlow/}
Chinese: StreamFlow提出了一种新颖的神经网络架构,通过扩散变换器和局部块状感受野策略实现高质量流式语音生成,其性能媲美非流式方法,且延迟仅为180毫秒。
English: StreamFlow introduces a novel neural architecture using diffusion transformers with a local block-wise receptive field strategy to enable high-quality streaming speech generation, achieving performance comparable to non-streaming methods and a low latency of 180 ms.
Authors:Feng Shu, Jiatong Bai, Di Wu, Wei Zhu, Bin Deng, Fuhui Zhou, Jiangzhou Wang
Abstract:
As a green MIMO structure, massive H$^2$AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusion (WF) method is proposed, which approximates inverse CRLB of each subarray using antenna number reciprocals to eliminate real-time CRLB computation. This reduces complexity and prior knowledge dependence while preserving fusion performance. Moreover, a multi-branch deep neural network (MBDNN) is constructed to further enhance direction-of-arrival (DOA) sensing by leveraging candidate angles from multiple subarrays. The subarray-specific branch networks are integrated with a shared regression module to effectively eliminate pseudo-solutions and fuse true angles. Simulation results show that the proposed CRLB-ratio-WF method achieves DOA sensing performance comparable to CRLB-based methods, while significantly reducing the reliance on prior knowledge. More notably, the proposed MBDNN has superior performance in low-SNR ranges. At SNR $= -15$ dB, it achieves an order-of-magnitude improvement in estimation accuracy compared to CRLB-ratio-WF method.
中文: 提出的CRLB-ratio-WF方法在降低复杂度和先验知识依赖的同时,保持了与基于CRLB方法相当的DOA感知性能,而MBDNN进一步提升了精度,尤其在低信噪比下(如-15 dB)实现了数量级的改进。
English: The proposed CRLB-ratio-WF method reduces complexity and prior knowledge dependence in massive H²AD systems while maintaining DOA sensing performance comparable to CRLB-based methods, with the MBDNN further enhancing accuracy, especially achieving an order-of-magnitude improvement at low SNR levels like -15 dB.
Authors:Taijin Zhao, Heqian Qiu, Yu Dai, Lanxiao Wang, Fanman Meng, Qingbo Wu, Hongliang Li
Abstract:
Few-shot object detection (FSOD) aims to detect objects with limited samples for novel classes, while relying on abundant data for base classes. Existing FSOD approaches, predominantly built on the Faster R-CNN detector, entangle objectness recognition and foreground classification within shared feature spaces. This paradigm inherently establishes class-specific objectness criteria and suffers from unrepresentative novel class samples. To resolve this limitation, we propose a Uniform Orthogonal Feature Space (UOFS) optimization framework. First, UOFS decouples the feature space into two orthogonal components, where magnitude encodes objectness and angle encodes classification. This decoupling enables transferring class-agnostic objectness knowledge from base classes to novel classes. Moreover, implementing the disentanglement requires careful attention to two challenges: (1) Base set images contain unlabeled foreground instances, causing confusion between potential novel class instances and backgrounds. (2) Angular optimization depends exclusively on base class foreground instances, inducing overfitting of angular distributions to base classes. To address these challenges, we propose a Hybrid Background Optimization (HBO) strategy: (1) Constructing a pure background base set by removing unlabeled instances in original images to provide unbiased magnitude-based objectness supervision. (2) Incorporating unlabeled foreground instances in the original base set into angular optimization to enhance distribution uniformity. Additionally, we propose a Spatial-wise Attention Disentanglement and Association (SADA) module to address task conflicts between class-agnostic and class-specific tasks. Experiments demonstrate that our method significantly outperforms existing approaches based on entangled feature spaces.
中文摘要:本文提出均匀正交特征空间(UOFS)框架,通过解耦目标性和分类特征改进小样本目标检测,采用混合背景优化策略和空间注意力解耦模块解决核心难题,显著超越了现有基于耦合特征空间的方法。
English Summary: This paper introduces a Uniform Orthogonal Feature Space (UOFS) framework that decouples objectness and classification features to improve few-shot object detection, addressing key challenges through Hybrid Background Optimization and a Spatial-wise Attention module to outperform existing methods.
Authors:Xin Wang, Jiyao Liu, Yulong Xiao, Junzhi Ning, Lihao Liu, Junjun He, Botian Shi, Kaicheng Yu
Abstract:
Large Language Models (LLMs) are accelerating scientific idea generation, but rigorously evaluating these numerous, often superficial, AI-generated propositions for novelty and factual accuracy is a critical bottleneck; manual verification is too slow. Existing validation methods are inadequate: LLMs as standalone verifiers may hallucinate and lack domain knowledge (our findings show 60% unawareness of relevant papers in specific domains), while traditional citation networks lack explicit causality and narrative surveys are unstructured. This underscores a core challenge: the absence of structured, verifiable, and causally-linked historical data of scientific evolution.To address this,we introduce \textbf{THE-Tree} (\textbf{T}echnology \textbf{H}istory \textbf{E}volution Tree), a computational framework that constructs such domain-specific evolution trees from scientific literature. THE-Tree employs a search algorithm to explore evolutionary paths. During its node expansion, it utilizes a novel "Think-Verbalize-Cite-Verify" process: an LLM proposes potential advancements and cites supporting literature. Critically, each proposed evolutionary link is then validated for logical coherence and evidential support by a recovered natural language inference mechanism that interrogates the cited literature, ensuring that each step is grounded. We construct and validate 88 THE-Trees across diverse domains and release a benchmark dataset including up to 71k fact verifications covering 27k papers to foster further research. Experiments demonstrate that i) in graph completion, our THE-Tree improves hit@1 by 8% to 14% across multiple models compared to traditional citation networks; ii) for predicting future scientific developments, it improves hit@1 metric by nearly 10%; and iii) when combined with other methods, it boosts the performance of evaluating important scientific papers by almost 100%.
中文: 大语言模型虽加速了科学思想的生成,但在验证新颖性和准确性方面存在瓶颈,因此我们提出了THE-Tree计算框架,它从科学文献构建领域特定的进化树,并通过“思考-表述-引用-验证”流程确保每个进化步骤的逻辑连贯性和证据支持。
English: Large Language Models (LLMs) accelerate scientific idea generation but face a bottleneck in verifying novelty and accuracy, leading to the introduction of THE-Tree, a computational framework that constructs domain-specific evolution trees from literature and validates each step through a "Think-Vericalize-Cite-Verify" process to ensure logical coherence and evidential support.
Authors:Junqi Jiang, Tom Bewley, Salim I. Amoukou, Francesco Leofante, Antonio Rago, Saumitra Mishra, Francesca Toni
Abstract:
Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
Chinese: 提出的表示一致性方法通过结合答案频率和内部激活一致性来增强测试时扩展,无需额外模型查询即可实现高达4%的准确率提升。
English: The proposed representation consistency method enhances test-time scaling by aggregating LLM responses based on both answer frequency and internal activation consistency, requiring no extra model queries and achieving up to 4% accuracy improvements.
Authors:Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Abstract:
Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.
中文: 本研究提出了一种新颖的基于Transformer的模型,通过文本提示(如“更快”或“更平静”)动态控制对话系统中的话轮转换预测,利用大量人类对话数据和LLM生成的合成提示,有效提高了预测准确性并适应不同对话场景。
English: This study introduces a novel transformer-based model that dynamically controls turn-taking prediction in dialogue systems using textual prompts like "faster" or "calmer," improving accuracy and adaptability based on extensive human dialogue data and LLM-generated synthetic prompts.
Authors:Junqi Jiang, Antonio Rago, Francesco Leofante, Francesca Toni
Abstract:
In machine learning, it is common to obtain multiple equally performing models for the same prediction task, e.g., when training neural networks with different random seeds. Model multiplicity (MM) is the situation which arises when these competing models differ in their predictions for the same input, for which ensembling is often employed to determine an aggregation of the outputs. Providing recourse recommendations via counterfactual explanations (CEs) under MM thus becomes complex, since the CE may not be valid across all models, i.e., the CEs are not robust under MM. In this work, we formalise the problem of providing recourse under MM, which we name recourse-aware ensembling (RAE). We propose the idea that under MM, CEs for each individual model should be considered alongside their predictions so that the aggregated prediction and recourse are decided in tandem. Centred around this intuition, we introduce six desirable properties for solutions to this problem. For solving RAE, we propose a novel argumentative ensembling method which guarantees the robustness of CEs under MM. Specifically, our method leverages computational argumentation to explicitly represent the conflicts between models and counterfactuals regarding prediction results and CE validity. It then uses argumentation semantics to resolve the conflicts and obtain the final solution, in a manner which is parametric to the chosen semantics. Our method also allows for the specification of preferences over the models under MM, allowing further customisation of the ensemble. In a comprehensive theoretical analysis, we characterise the behaviour of argumentative ensembling with four different argumentation semantics. We then empirically demonstrate the effectiveness of our approach in satisfying desirable properties with eight instantiations of our method. (Abstract is shortened for arXiv.)
中文: 本文针对模型多重性下反事实解释的鲁棒性问题,提出资源感知集成方法,通过论证式集成技术保证解释的鲁棒性,同时支持对模型的偏好设置。
English: This paper introduces recourse-aware ensembling (RAE) to address the challenge of providing robust counterfactual explanations under model multiplicity, proposing an argumentative ensembling method that guarantees explanation robustness while allowing model preferences.
Authors:Christiaan Lamers, Ahmed Nabil Belbachir, Thomas Bäck, Niki van Stein
Abstract:
Catastrophic forgetting can be trivially alleviated by keeping all data from previous tasks in memory. Therefore, minimizing the memory footprint while maximizing the amount of relevant information is crucial to the challenge of continual learning. This paper aims to decrease required memory for memory-based continuous learning algorithms. We explore the options of extracting a minimal amount of information, while maximally alleviating forgetting. We propose the usage of lightweight generators based on Singular Value Decomposition to enhance existing continual learning methods, such as A-GEM and Experience Replay. These generators need a minimal amount of memory while being maximally effective. They require no training time, just a single linear-time fitting step, and can capture a distribution effectively from a small number of data samples. Depending on the dataset and network architecture, our results show a significant increase in average accuracy compared to the original methods. Our method shows great potential in minimizing the memory footprint of memory-based continual learning algorithms.
中文: 本文提出基于奇异值分解的轻量生成器,在持续学习中大幅降低内存需求,无需训练时间即可显著提升准确率。
English: This paper introduces lightweight generators using Singular Value Decomposition to reduce memory requirements in continual learning, significantly boosting accuracy without training time.
Authors:Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, Yin Zheng
Abstract:
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.
中文:多语言大语言模型发展迅速,但现有评估数据集有限且缺乏跨语言对齐,为此我们推出MuBench基准,涵盖61种语言并评估广泛能力,发现模型在低资源语言上表现显著落后于英语,并提出多语言一致性作为补充指标以指导模型优化。
English: Multilingual large language models are rapidly evolving but face evaluation challenges due to limited and misaligned datasets, prompting the introduction of MuBench—a comprehensive benchmark covering 61 languages that reveals performance gaps, especially for low-resource languages, and proposes Multilingual Consistency as a metric to guide improvements.
Authors:Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park
Abstract:
Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerable to outage channels, where the loss of a single token can significantly distort the original message semantics. Motivated by this, this paper focuses on optimizing token packetization to maximize the average token similarity (ATS) between the original and received token messages under outage channels. Due to inter-token dependency, this token grouping problem is combinatorial, with complexity growing exponentially with message length. To address this, we propose a novel framework of semantic packet aggregation with lookahead search (SemPA-Look), built on two core ideas. First, it introduces the residual semantic score (RSS) as a token-level surrogate for the message-level ATS, allowing robust semantic preservation even when a certain token packet is lost. Second, instead of full search, SemPA-Look applies a lookahead search-inspired algorithm that samples intra-packet token candidates without replacement (fixed depth), conditioned on inter-packet token candidates sampled with replacement (fixed width), thereby achieving linear complexity. Experiments on a remote AIGC task with the MS-COCO dataset (text captioned images) demonstrate that SemPA-Look achieves high ATS and LPIPS scores comparable to exhaustive search, while reducing computational complexity by up to 40$\times$. Compared to other linear-complexity algorithms such as the genetic algorithm (GA), SemPA-Look achieves 10$\times$ lower complexity, demonstrating its practicality for remote AIGC and other TC applications.
Chinese: 本文提出SemPA-Look框架,通过残差语义评分和前向搜索算法优化令牌分组,在易中断信道中保持语义相似性,以大幅降低的计算复杂度实现接近最优的性能。
English: This paper introduces SemPA-Look, a framework that optimizes token packetization to preserve semantic similarity in outage-prone channels by using residual semantic scores and a lookahead search algorithm, achieving near-optimal performance with significantly reduced computational complexity.
Authors:Chi Xie, Shuang Liang, Jie Li, Feng Zhu, Rui Zhao, Yichen Wei, Shengjie Zhao
Abstract:
Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.
中文: 本研究针对人-物交互检测在分布偏移下的不足,建立了首个鲁棒性评估基准,并通过跨域数据增强与特征融合策略显著提升了多种方法的检测稳健性。
English: This study addresses the limitations of Human-Object Interaction detection under distribution shifts by creating a robustness benchmark and proposing cross-domain data augmentation with feature fusion to significantly enhance model performance.
Authors:Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang
Abstract:
Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.
中文摘要:PlanMoGPT框架通过渐进式规划和流增强标记化解决了基于大语言模型的文本到动作生成中的关键瓶颈,在质量和多样性方面均实现了最先进的性能表现。
English Summary: The proposed PlanMoGPT framework overcomes the motion tokenization bottleneck in LLM-based text-to-motion generation through progressive planning and flow-enhanced tokenization, achieving state-of-the-art performance with significant improvements in both motion quality and diversity.
Authors:Danielle R. Thomas, Conrad Borchers, Jionghao Lin, Sanjit Kakarla, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Ralph Abboud, Kenneth R. Koedinger
Abstract:
Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors' adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.
中文摘要:本研究证实了使用生成式AI模型有效识别和评估数学辅导中特定教学技巧的可行性,在检测教师表扬和学生错误回应方面达到高准确率,并与人工评估结果高度吻合。
English Summary: This study demonstrates the feasibility of using generative AI models to effectively identify and evaluate specific tutoring techniques in math sessions, achieving high accuracy in detecting praise delivery and error responses while closely aligning with human assessments.
Authors:Jionghao Lin, Jiarui Rao, Yiyang Zhao, Yuting Wang, Ashish Gurung, Amanda Barany, Jaclyn Ocumpaugh, Ryan S. Baker, Kenneth R. Koedinger
Abstract:
We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students' Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.
中文摘要:本研究利用GPT-4o和任务分解提示技术开发了自动生成互动式导师培训课程的系统,人工评估显示生成课程结构清晰且节省时间,但也存在反馈模板化等局限性。
English Summary: This study develops an AI system using GPT-4o and task decomposition prompting to automatically generate interactive tutor training lessons, with human evaluation showing structured content and time efficiency despite some generic feedback limitations.
Authors:Zeyun Deng, Jasorsi Ghosh, Fiona Xie, Yuzhe Lu, Katia Sycara, Joseph Campbell
Abstract:
Reinforcement learning algorithms often suffer from poor sample efficiency, making them challenging to apply in multi-task or continual learning settings. Efficiency can be improved by transferring knowledge from a previously trained teacher policy to guide exploration in new but related tasks. However, if the new task sufficiently differs from the teacher's training task, the transferred guidance may be sub-optimal and bias exploration toward low-reward behaviors. We propose an energy-based transfer learning method that uses out-of-distribution detection to selectively issue guidance, enabling the teacher to intervene only in states within its training distribution. We theoretically show that energy scores reflect the teacher's state-visitation density and empirically demonstrate improved sample efficiency and performance across both single-task and multi-task settings.
中文: 提出的基于能量的迁移学习方法通过离群分布检测选择性应用教师指导,理论上利用状态访问密度,实证上在多种场景中提升了强化学习的效率和性能。
English: The proposed energy-based transfer learning method improves reinforcement learning efficiency by using out-of-distribution detection to selectively apply teacher guidance, theoretically leveraging state-visitation density and empirically enhancing performance in various settings.
Authors:Tom Jeleniewski, Hamied Nabizada, Jonathan Reif, Felix Gehlhoff, Alexander Fay
Abstract:
The formalization of process knowledge using ontologies enables consistent modeling of parameter interdependencies in manufacturing. These interdependencies are typically represented as mathematical expressions that define relations between process parameters, supporting tasks such as calculation, validation, and simulation. To support cross-context application and knowledge reuse, such expressions are often defined in a generic form and applied across multiple process contexts. This highlights the necessity of a consistent and semantically coherent model to ensure the correctness of data retrieval and interpretation. Consequently, dedicated mechanisms are required to address key challenges such as selecting context-relevant data, ensuring unit compatibility between variables and data elements, and verifying the completeness of input data required for evaluating mathematical expressions. This paper presents a set of verification mechanisms for a previously developed ontology-based process model that integrates standardized process semantics, data element definitions, and formal mathematical constructs. The approach includes (i) SPARQL-based filtering to retrieve process-relevant data, (ii) a unit consistency check based on expected-unit annotations and semantic classification, and (iii) a data completeness check to validate the evaluability of interdependencies. The applicability of the approach is demonstrated with a use case from Resin Transfer Molding (RTM), supporting the development of machine-interpretable and verifiable engineering models.
中文: 本文提出了一套基于本体过程模型的验证机制,包括SPARQL过滤、单位一致性检查和数据完整性验证,通过树脂传递模塑用例展示了其在制造过程中确保数据准确检索和解释的适用性。
English: This paper introduces verification mechanisms for an ontology-based process model to ensure accurate data retrieval and interpretation in manufacturing, including SPARQL filtering, unit consistency checks, and data completeness validation, demonstrated through a Resin Transfer Molding use case.
Authors:Peizhi Wu, Rong Kang, Tieying Zhang, Jianjun Chen, Ryan Marcus, Zachary G. Ives
Abstract:
Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP's compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new per-table CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark -- despite operating without any data access and using only 10% of all possible join templates.
中文摘要:GRASP是一种无需数据访问的基数估计系统,能有效应对现实场景中连接模板不完整与数据访问受限的挑战,在估计精度和查询延迟上均优于现有方法,其性能甚至可与依赖底层数据的传统方法相媲美。
English Summary: GRASP is a data-agnostic cardinality estimation system that effectively handles real-world constraints like incomplete join templates and data access restrictions, outperforming existing methods in accuracy and latency while matching traditional data-dependent approaches.
Authors:Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak
Abstract:
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
中文摘要:ViMaR是一种新颖的双阶段推理框架,通过结合价值引导选择与边界感知奖励机制,显著提升了视觉语言模型的推理效率和输出可靠性,在实现加速的同时生成更准确、详细的描述,并展现出优秀的跨模型泛化能力。
English Summary: ViMaR is a novel two-stage inference framework that enhances efficiency and output reliability in vision-language models by combining value-guided selection with margin-based penalties, achieving significant speedup and improved caption accuracy while demonstrating strong cross-model generalization.
Authors:Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, Mykola Pechenizkiy
Abstract:
Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms, with environment availability strongly impacting research. One particularly underexplored intersection is continual learning (CL) in cooperative multi-agent settings. To remedy this, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark tailored for continual multi-agent reinforcement learning (CMARL). Existing CL benchmarks run environments on the CPU, leading to computational bottlenecks and limiting the length of task sequences. MEAL leverages JAX for GPU acceleration, enabling continual learning across sequences of 100 tasks on a standard desktop PC in a few hours. We show that naively combining popular CL and MARL methods yields strong performance on simple environments, but fails to scale to more complex settings requiring sustained coordination and adaptation. Our ablation study identifies architectural and algorithmic features critical for CMARL on MEAL.
中文:MEAL是首个专为持续多智能体强化学习设计的基准平台,通过JAX实现GPU加速,能在普通电脑上高效完成百项任务序列学习,并发现现有方法在复杂协作场景中的组合存在明显局限性。
English: The MEAL benchmark is introduced as the first dedicated testbed for continual multi-agent reinforcement learning, utilizing JAX for GPU acceleration to enable efficient learning across 100 tasks and revealing limitations in naive combinations of existing methods for complex cooperative settings.
Authors:Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang
Abstract:
As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.
Chinese: 本文提出具有双重难度级别的多模态基准Argus Inspection及Eye of Panoptes框架,旨在解决多模态大语言模型在视觉细粒度感知与常识因果推理方面的不足,通过对26个主流模型的实验表明,其最高推理性能仅为0.46,揭示出巨大的改进空间。
English: This paper introduces Argus Inspection, a multimodal benchmark with dual difficulty levels, and the Eye of Panoptes framework to address MLLMs' limitations in visual fine-grained perception and commonsense causal inference, revealing through experiments on 26 models that the highest reasoning performance remains at 0.46, indicating significant room for improvement.
Authors:Jiaming Yu, Le Liang, Hao Ye, Shi Jin
Abstract:
High-density Wi-Fi deployments often result in significant co-channel interference, which degrades overall network performance. To address this issue, coordination of multi access points (APs) has been considered to enable coordinated spatial reuse (CSR) in next generation wireless local area networks. This paper tackles the challenge of downlink spatial reuse in Wi-Fi networks, specifically in scenarios involving overlapping basic service sets, by employing hierarchical multi-agent reinforcement learning (HMARL). We decompose the CSR process into two phases, i.e., a polling phase and a decision phase, and introduce the HMARL algorithm to enable efficient CSR. To enhance training efficiency, the proposed HMARL algorithm employs a hierarchical structure, where station selection and power control are determined by a high- and low-level policy network, respectively. Simulation results demonstrate that this approach consistently outperforms baseline methods in terms of throughput and latency across various network topologies. Moreover, the algorithm exhibits robust performance when coexisting with legacy APs. Additional experiments in a representative topology further reveal that the carefully designed reward function not only maximizes the overall network throughput, but also improves fairness in transmission opportunities for APs in high-interference regions.
中文摘要:本文采用分层多智能体强化学习(HMARL)方法优化Wi-Fi网络下行空间复用,在提升网络吞吐量和降低延迟的同时,与传统接入点保持良好兼容性,并通过精心设计的奖励函数增强了高干扰区域接入点的传输公平性。
English Summary: This paper proposes a hierarchical multi-agent reinforcement learning (HMARL) approach to optimize downlink spatial reuse in Wi-Fi networks, significantly improving throughput and latency while maintaining robust performance with legacy access points.
Authors:Milapji Singh Gill, Tom Jeleniewski, Felix Gehlhoff, Alexander Fay
Abstract:
Time-continuous dynamic models are essential for various Cyber-Physical System (CPS) applications. To ensure effective usability in different lifecycle phases, such behavioral information in the form of differential equations must be contextualized and integrated with further CPS information. While knowledge graphs provide a formal description and structuring mechanism for this task, there is a lack of reusable ontological artifacts and methods to reduce manual instantiation effort. Hence, this contribution introduces two artifacts: Firstly, a modular semantic model based on standards is introduced to represent differential equations directly within knowledge graphs and to enrich them semantically. Secondly, a method for efficient knowledge graph generation is presented. A validation of these artifacts was conducted in the domain of aviation maintenance. Results show that differential equations of a complex Electro-Hydraulic Servoactuator can be formally represented in a knowledge graph and be contextualized with other lifecycle data, proving the artifacts' practical applicability.
中文摘要:本文提出了一种模块化语义模型和高效方法,将微分方程融入知识图谱以情境化信息物理系统的生命周期数据,并通过航空维护领域的验证证明了其实际应用价值。
English Summary: This paper introduces a modular semantic model and an efficient method for integrating differential equations into knowledge graphs to contextualize Cyber-Physical System lifecycle data, with aviation maintenance validation demonstrating practical applicability.
Authors:Zhenyan Lu, Daliang Xu, Dongqi Cai, Zexi Li, Wei Liu, Fangming Liu, Shangguang Wang, Mengwei Xu
Abstract:
Large language models (LLMs) are deployed on mobile devices to power killer applications such as intelligent assistants. LLMs pre-trained on general corpora often hallucinate when handling personalized or unseen queries, leading to incorrect or outdated responses. Knowledge editing addresses this by identifying and adjusting a small crucial portion of model weights, without compromising the general knowledge. However, prior knowledge editing methods are impractical to run on local devices due to the resource-heavy backpropagation (BP) needed for updates. We present MobiEdit, the first mobile knowledge editing framework that enables efficient LLM personalization on commercial off-the-shelf (COTS) mobile devices. MobiEdit replaces full-precision BP with quantized forward-only gradient estimation, thus compatible with the energy-efficient mobile neural processing units (NPUs). MobiEdit replaces full-precision backpropagation with quantized forward-only gradient estimation, making it compatible with energy-efficient mobile NPUs. To further improve gradient estimation efficiency, we introduce two optimizations: an early stoping mechanism that adaptively terminates editing upon success and a prefix cache that reuses computation across steps. Our approach enables real-time editing of a 3B-parameter model (Qwen2.5-3B-Instruct) on COTS mobile devices with 7.6$\times$ less memory, 14.7 $\times$ less energy and 3.6$\times$ less latency compared to previous knowledge editing methods.
大型语言模型在处理个性化查询时易产生错误响应,而MobiEdit通过用量化前向梯度估计替代高资源消耗的反向传播,首次在商用移动设备上实现高效知识编辑,大幅降低了内存占用、能耗和延迟。
Large language models often produce inaccurate responses to personalized queries, but MobiEdit enables efficient on-device knowledge editing by replacing resource-heavy backpropagation with quantized forward-only gradient estimation, significantly reducing memory, energy, and latency on mobile devices.
Authors:Guy Laban, Micol Spitale, Minja Axelsson, Nida Itrat Abbasi, Hatice Gunes
Abstract:
Social robots are increasingly being explored as tools to support emotional wellbeing, particularly in non-clinical settings. Drawing on a range of empirical studies and practical deployments, this paper outlines six key insights that highlight both the opportunities and challenges in using robots to promote mental wellbeing. These include (1) the lack of a single, objective measure of wellbeing, (2) the fact that robots don't need to act as companions to be effective, (3) the growing potential of virtual interactions, (4) the importance of involving clinicians in the design process, (5) the difference between one-off and long-term interactions, and (6) the idea that adaptation and personalization are not always necessary for positive outcomes. Rather than positioning robots as replacements for human therapists, we argue that they are best understood as supportive tools that must be designed with care, grounded in evidence, and shaped by ethical and psychological considerations. Our aim is to inform future research and guide responsible, effective use of robots in mental health and wellbeing contexts.
中文: 本文概述了利用社交机器人支持心理健康的六个关键见解,强调其应作为辅助工具而非人类治疗师的替代品,并指出设计需基于证据并符合伦理考量。
English: This paper outlines six key insights on using social robots to support mental wellbeing, emphasizing their role as supportive tools rather than replacements for human therapists, and stressing the need for evidence-based, ethically-informed design.
Authors:Fen Liu, Shenghai Yuan, Thien-Minh Nguyen, Rong Su
Abstract:
Commercial UAVs are an emerging security threat as they are capable of carrying hazardous payloads or disrupting air traffic. To counter UAVs, we introduce an autonomous 3D target encirclement and interception strategy. Unlike traditional ground-guided systems, this strategy employs autonomous drones to track and engage non-cooperative hostile UAVs, which is effective in non-line-of-sight conditions, GPS denial, and radar jamming, where conventional detection and neutralization from ground guidance fail. Using two noisy real-time distances measured by drones, guardian drones estimate the relative position from their own to the target using observation and velocity compensation methods, based on anti-synchronization (AS) and an X$-$Y circular motion combined with vertical jitter. An encirclement control mechanism is proposed to enable UAVs to adaptively transition from encircling and protecting a target to encircling and monitoring a hostile target. Upon breaching a warning threshold, the UAVs may even employ a suicide attack to neutralize the hostile target. We validate this strategy through real-world UAV experiments and simulated analysis in MATLAB, demonstrating its effectiveness in detecting, encircling, and intercepting hostile drones. More details: https://youtu.be/5eHW56lPVto.
中文摘要:本文提出了一种自主三维环绕拦截策略,利用守护无人机在GPS拒止和雷达干扰等恶劣环境下有效对抗敌对无人机,并通过真实实验和仿真验证了其探测、包围及拦截的有效性。
English Summary: This paper presents an autonomous 3D encirclement and interception strategy using guardian drones to counter hostile UAVs, which operates effectively in challenging conditions such as GPS denial and radar jamming, and has been validated through real-world experiments and simulations.
Authors:Wei Li, Yunyao Cheng, Xinli Hao, Chaohong Ma, Yuxuan Liang, Bin Yang, Christian S. Jensen, Xiaofeng Meng
Abstract:
Recent advances in Large Language Models (LLMs) have enabled unprecedented capabilities for time-series reasoning in diverse real-world applications, including medical, financial, and spatio-temporal domains. However, existing approaches typically focus on task-specific model customization, such as forecasting and anomaly detection, while overlooking the data itself, referred to as time-series primitives, which are essential for in-depth reasoning. This position paper advocates a fundamental shift in approaching time-series reasoning with LLMs: prioritizing alignment paradigms grounded in the intrinsic primitives of time series data over task-specific model customization. This realignment addresses the core limitations of current time-series reasoning approaches, which are often costly, inflexible, and inefficient, by systematically accounting for intrinsic structure of data before task engineering. To this end, we propose three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, which are emphasized by prioritizing different aspects of time-series primitives: domain, characteristic, and representation, respectively, to activate time-series reasoning capabilities of LLMs to enable economical, flexible, and efficient reasoning. We further recommend that practitioners adopt an alignment-oriented method to avail this instruction to select an appropriate alignment paradigm. Additionally, we categorize relevant literature into these alignment paradigms and outline promising research directions.
中文摘要:本立场论文主张从任务定制转向基于时间序列本原(领域、特征和表示)的对齐范式,提出了三种对齐方法以增强大语言模型的时间序列推理能力,实现更经济、灵活和高效的时序分析。
English Summary: This position paper advocates shifting from task-specific customization to alignment paradigms based on time-series primitives—domain, characteristic, and representation—to enhance LLMs' reasoning capabilities, proposing three alignment methods for more economical, flexible, and efficient time-series analysis.
Authors:Yihan Wu, Xuehao Cui, Ruibo Chen, Georgios Milis, Heng Huang
Abstract:
The rapid evolution of image generation models has revolutionized visual content creation, enabling the synthesis of highly realistic and contextually accurate images for diverse applications. However, the potential for misuse, such as deepfake generation, image based phishing attacks, and fabrication of misleading visual evidence, underscores the need for robust authenticity verification mechanisms. While traditional statistical watermarking techniques have proven effective for autoregressive language models, their direct adaptation to image generation models encounters significant challenges due to a phenomenon we term retokenization mismatch, a disparity between original and retokenized sequences during the image generation process. To overcome this limitation, we propose C-reweight, a novel, distortion-free watermarking method explicitly designed for image generation models. By leveraging a clustering-based strategy that treats tokens within the same cluster equivalently, C-reweight mitigates retokenization mismatch while preserving image fidelity. Extensive evaluations on leading image generation platforms reveal that C-reweight not only maintains the visual quality of generated images but also improves detectability over existing distortion-free watermarking techniques, setting a new standard for secure and trustworthy image synthesis.
Chinese Summary: 本研究提出C-reweight这一新型无失真水印技术,通过基于聚类的等效标记处理解决了图像生成模型中的重标记化失配问题,在保持视觉质量的同时显著提升了水印检测能力。
English Summary: The study introduces C-reweight, a novel distortion-free watermarking technique that overcomes retokenization mismatch in image generation models by using cluster-based token equivalence, enhancing detectability without compromising visual quality.
Authors:Ziyi Zhang, Ziheng Jiang, Chengquan Jiang, Menghan Yu, Size Zheng, Haibin Lin, Henry Hoffmann, Xin Liu
Abstract:
Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a small draft model with a larger target model) and tensor parallelism has each accelerated decoding. However, conventional approaches fail to apply both simultaneously due to imbalanced compute requirements (between draft and target models), KV-cache inconsistencies, and communication overheads under small-batch tensor-parallelism. This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding. SwiftSpec redesigns the speculative decoding pipeline in an asynchronous and disaggregated manner, so that each component can be scaled flexibly and remove draft overhead from the critical path. To realize this design, SwiftSpec proposes parallel tree generation, tree-aware KV cache management, and fused, latency-optimized kernels to overcome the challenges listed above. Across 5 model families and 6 datasets, SwiftSpec achieves an average of 1.75x speedup over state-of-the-art speculative decoding systems and, as a highlight, serves Llama3-70B at 348 tokens/s on 8 Nvidia Hopper GPUs, making it the fastest known system for low-latency LLM serving at this scale.
中文: SwiftSpec提出了一种异步解耦的推测解码系统,通过并行树生成和优化的KV缓存管理解决了计算不均衡与通信瓶颈,实现了超低延迟,在Llama3-70B模型上达到348 tokens/秒的推理速度,性能超越现有最优系统1.75倍。
English: SwiftSpec introduces an asynchronous, disaggregated speculative decoding system that overcomes computational imbalances and communication bottlenecks to achieve ultra-low latency, serving models like Llama3-70B at 348 tokens/s and delivering 1.75x speedup over existing methods.
Authors:Wenlong Hou, Guangqian Yang, Ye Du, Yeung Lau, Lihao Liu, Junjun He, Ling Long, Shujun Wang
Abstract:
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.
中文: 本文提出了ADAgent,首个基于大语言模型的阿尔茨海默病分析AI智能体,它整合了多模态数据处理,在诊断和预后任务中显著优于现有最优方法。
English: This paper introduces ADAgent, the first AI agent based on a large language model designed for Alzheimer's disease analysis, which integrates multi-modal data processing and outperforms state-of-the-art methods in diagnosis and prognosis tasks.
Authors:Hanxi Guo, Siyuan Cheng, Kaiyuan Zhang, Guangyu Shen, Xiangyu Zhang
Abstract:
Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code. While these models boost programming productivity, their misuse introduces critical risks, including code plagiarism, license violations, and the propagation of insecure programs. As a result, robust detection of AI-generated code is essential. To support the development of such detectors, a comprehensive benchmark that reflects real-world conditions is crucial. However, existing benchmarks fall short -- most cover only a limited set of programming languages and rely on less capable generative models. In this paper, we present CodeMirage, a comprehensive benchmark that addresses these limitations through three major advancements: (1) it spans ten widely used programming languages, (2) includes both original and paraphrased code samples, and (3) incorporates outputs from ten state-of-the-art production-level LLMs, including both reasoning and non-reasoning models from six major providers. Using CodeMirage, we evaluate ten representative detectors across four methodological paradigms under four realistic evaluation configurations, reporting results using three complementary metrics. Our analysis reveals nine key findings that uncover the strengths and weaknesses of current detectors, and identify critical challenges for future work. We believe CodeMirage offers a rigorous and practical testbed to advance the development of robust and generalizable AI-generated code detectors.
中文摘要:大语言模型提升了编程效率,但也带来了代码抄袭和安全漏洞等风险,因此需要可靠的检测工具;CodeMirage作为全面基准,覆盖十种编程语言和前沿模型,评估检测器性能并揭示其不足,为未来发展提供关键指导。
English Summary: Large language models boost programming productivity but introduce risks like plagiarism and insecure code, necessitating robust detection methods, which CodeMirage addresses as a comprehensive benchmark spanning ten languages and state-of-the-art models to evaluate detectors and uncover their limitations.
Authors:Alireza Salemi, Mukta Maddipatla, Hamed Zamani
Abstract:
This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.
中文摘要:本文提出mRAG,一种多智能体检索增强生成框架,通过专业化分工与基于奖励的自训练机制优化协作,在SIGIR 2025竞赛中验证了其相较于传统方法的优越性。
English Summary: This paper introduces mRAG, a multi-agent RAG framework that employs specialized agents and self-training with reward-guided sampling to optimize collaboration and improve response generation, demonstrating superior performance over traditional RAG systems in evaluations.
Authors:Wenxuan Song, Jiayi Chen, Wenxue Li, Xu He, Han Zhao, Can Cui, Pengxiang Ding Shiyan Su, Feilong Tang, Xuelian Cheng, Donglin Wang, Zongyuan Ge, Xinhu Zheng, Zhe Liu, Hesheng Wang, Haoang Li
Abstract:
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications. Our project page is https://irpn-eai.github.io/RationalVLA.
中文:RAMA基准和RationalVLA模型通过让机器人能够拒绝不可行指令并有效执行有效指令,解决了机器人指令模糊性问题,在操作任务中实现了卓越性能。
English: The RAMA benchmark and RationalVLA model address robotic instruction ambiguity by enabling robots to reject infeasible commands and effectively execute valid ones, achieving superior performance in manipulation tasks.
Authors:Yucheng Yang, Tianyi Zhou, Qiang He, Lei Han, Mykola Pechenizkiy, Meng Fang
Abstract:
Unsupervised reinforcement learning (URL) aims to learn general skills for unseen downstream tasks. Mutual Information Skill Learning (MISL) addresses URL by maximizing the mutual information between states and skills but lacks sufficient theoretical analysis, e.g., how well its learned skills can initialize a downstream task's policy. Our new theoretical analysis in this paper shows that the diversity and separability of learned skills are fundamentally critical to downstream task adaptation but MISL does not necessarily guarantee these properties. To complement MISL, we propose a novel disentanglement metric LSEPIN. Moreover, we build an information-geometric connection between LSEPIN and downstream task adaptation cost. For better geometric properties, we investigate a new strategy that replaces the KL divergence in information geometry with Wasserstein distance. We extend the geometric analysis to it, which leads to a novel skill-learning objective WSEP. It is theoretically justified to be helpful to downstream task adaptation and it is capable of discovering more initial policies for downstream tasks than MISL. We finally propose another Wasserstein distance-based algorithm PWSEP that can theoretically discover all optimal initial policies.
中文摘要:本文对无监督强化学习进行了理论分析,提出了新度量指标LSEPIN和算法WSEP、PWSEP,通过确保技能多样性和可分离性,在适应下游任务方面优于现有方法。
English Summary: This paper theoretically analyzes unsupervised reinforcement learning and proposes new metrics and algorithms, including WSEP and PWSEP, that outperform existing methods by ensuring better skill diversity and separability for downstream task adaptation.
Authors:Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang
Abstract:
Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.
中文摘要:RRP框架通过结合知识图谱与大语言模型,挖掘并优化推理路径,有效解决复杂问题中的结构挑战,以即插即用方式显著提升模型的推理能力,并在实验中达到领先性能。
English Summary: The RRP framework enhances large language models by integrating knowledge graphs to generate refined reasoning paths, addressing challenges in complex question-solving and achieving state-of-the-art performance through semantic and structural synergy.
Authors:Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu
Abstract:
Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information and mitigating catastrophic forgetting. To tackle these issues, we propose an MCIL method based on multimodal pre-trained models. Firstly, a Multimodal Incremental Feature Extractor (MIFE) based on Mixture-of-Experts (MoE) structure is introduced to achieve effective incremental fine-tuning for AudioCLIP. Secondly, to enhance feature discriminability and generalization, we propose an Adaptive Audio-Visual Fusion Module (AAVFM) that includes a masking threshold mechanism and a dynamic feature fusion mechanism, along with a strategy to enhance text diversity. Thirdly, a novel multimodal class-incremental contrastive training loss is proposed to optimize cross-modal alignment in MCIL. Finally, two MCIL-specific evaluation metrics are introduced for comprehensive assessment. Extensive experiments on three multimodal datasets validate the effectiveness of our method.
中文摘要:本文提出了一种基于预训练模型的新型多模态类增量学习方法,通过增量特征提取器、自适应融合模块和对比训练损失,有效解决了视觉、音频和文本模态的整合与灾难性遗忘问题。
English Summary: This paper introduces a novel multimodal class-incremental learning method using pre-trained models, featuring specialized modules for incremental fine-tuning, adaptive fusion, and contrastive training to address challenges in integrating vision, audio, and text modalities while preventing catastrophic forgetting.
Authors:Yitao Xu, Tong Zhang, Ehsan Pajouheshgar, Sabine Süsstrunk
Abstract:
Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
中文: 条件扩散模型可能混淆类别特征与无关背景,但提出的CLAReps方法能提取保留核心类别语义的规范潜在表征,并通过名为CaDistill的特征蒸馏范式,使学生模型专注于关键类别信号,从而显著提升对抗鲁棒性和泛化能力。
English: Conditional diffusion models can entangle class features with irrelevant context, but the proposed CLAReps method extracts canonical latent representations that preserve core class semantics and enable a feature-distillation paradigm called CaDistill, enhancing student models' robustness and generalization by focusing on essential class signals.
Authors:Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang
Abstract:
Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2
中文: 本研究提出自回归对抗后训练方法,将预训练视频扩散模型转化为实时生成器,在单个H100 GPU上实现736x416分辨率24帧/秒的视频生成,通过单步推理和KV缓存优化支持交互式流媒体生成。
English: This work introduces autoregressive adversarial post-training (AAPT) to convert pre-trained video diffusion models into real-time generators that produce 24fps video at 736x416 resolution on a single H100 GPU, enabling interactive streaming with minimal computational cost.
Authors:Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang
Abstract:
Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2
中文: 本研究提出自回归对抗后训练方法,将预训练视频扩散模型转化为实时生成器,在单个H100 GPU上实现736x416分辨率24帧/秒的视频生成,通过单步推理和KV缓存优化支持交互式流媒体生成。
English: This work introduces autoregressive adversarial post-training (AAPT) to convert pre-trained video diffusion models into real-time generators that produce 24fps video at 736x416 resolution on a single H100 GPU, enabling interactive streaming with minimal computational cost.
Authors:Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, Cheng-Lin Liu
Abstract:
Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.
中文: 本文提出一种简单有效的多模态模型持续学习方法,通过频谱感知巩固和无监督查询正则化来改善任务平衡并防止基础模型退化,实现了与多任务联合学习相当甚至更优的性能。
English: This paper introduces a simple yet effective continual learning method for multimodal models, featuring spectral-aware consolidation and unsupervised inquiry regularization to enhance task balance and prevent base model degradation, achieving performance comparable to or surpassing multitask joint learning.
Authors:Rishabh Ranjan, Likhith Ayinala, Mayank Vatsa, Richa Singh
Abstract:
This paper introduces a novel multimodal framework for hate speech detection in deepfake audio, excelling even in zero-shot scenarios. Unlike previous approaches, our method uses contrastive learning to jointly align audio and text representations across languages. We present the first benchmark dataset with 127,290 paired text and synthesized speech samples in six languages: English and five low-resource Indian languages (Hindi, Bengali, Marathi, Tamil, Telugu). Our model learns a shared semantic embedding space, enabling robust cross-lingual and cross-modal classification. Experiments on two multilingual test sets show our approach outperforms baselines, achieving accuracies of 0.819 and 0.701, and generalizes well to unseen languages. This demonstrates the advantage of combining modalities for hate speech detection in synthetic media, especially in low-resource settings where unimodal models falter. The Dataset is available at https://www.iab-rubric.org/resources.
本文提出了一种针对深度伪造音频的多模态仇恨言论检测框架,通过对比学习跨语言对齐音频与文本表征,在零样本场景和低资源环境下实现了优越性能。
This paper presents a multimodal hate speech detection framework for deepfake audio that uses contrastive learning to align audio-text representations across languages, achieving superior performance in zero-shot scenarios and low-resource settings.
Authors:Zikai Xiao, Ziyang Wang, Wen Ma, Yan Zhang, Wei Shen, Yan Wang, Luqi Gong, Zuozhu Liu
Abstract:
While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.
大型语言模型在处理长文本时存在性能下降问题,而无需训练的对比位置解码方法通过对比不同注意力机制,有效缓解了注意力分数衰减,在长文本基准测试中取得了最优性能。
Large Language Models (LLMs) face performance degradation in long contexts, but the proposed training-free Positional Contrastive Decoding (PCD) method effectively mitigates this issue by contrasting attention mechanisms, achieving state-of-the-art results on benchmarks.
Authors:Lorenzo Arboit, Dennis N. Schneider, Toby Collins, Daniel A. Hashimoto, Silvana Perretta, Bernard Dallemagne, Jacques Marescaux, EAES Working Group, Nicolas Padoy, Pietro Mascagni
Abstract:
Artificial Intelligence (AI) is transforming medicine, with generative AI models like ChatGPT reshaping perceptions of its potential. This study examines surgeons' awareness, expectations, and involvement with AI in surgery through comparative surveys conducted in 2021 and 2024. Two cross-sectional surveys were distributed globally in 2021 and 2024, the first before an IRCAD webinar and the second during the annual EAES meeting. The surveys assessed demographics, AI awareness, expectations, involvement, and ethics (2024 only). The surveys collected a total of 671 responses from 98 countries, 522 in 2021 and 149 in 2024. Awareness of AI courses rose from 14.5% in 2021 to 44.6% in 2024, while course attendance increased from 12.9% to 23%. Despite this, familiarity with foundational AI concepts remained limited. Expectations for AI's role shifted in 2024, with hospital management gaining relevance. Ethical concerns gained prominence, with 87.2% of 2024 participants emphasizing accountability and transparency. Infrastructure limitations remained the primary obstacle to implementation. Interdisciplinary collaboration and structured training were identified as critical for successful AI adoption. Optimism about AI's transformative potential remained high, with 79.9% of respondents believing AI would positively impact surgery and 96.6% willing to integrate AI into their clinical practice. Surgeons' perceptions of AI are evolving, driven by the rise of generative AI and advancements in surgical data science. While enthusiasm for integration is strong, knowledge gaps and infrastructural challenges persist. Addressing these through education, ethical frameworks, and infrastructure development is essential.
中文: 本研究显示,从2021年到2024年,外科医生对AI在手术中应用的认知度和接受度显著提升,尽管存在知识差距和基础设施挑战,但他们对AI整合持高度乐观态度,需要通过教育和伦理框架加以解决。
English: This study reveals that surgeons' awareness and acceptance of AI in surgery have significantly increased from 2021 to 2024, with strong optimism about its integration despite persistent knowledge gaps and infrastructure challenges requiring educational and ethical solutions.
Authors:Chenxi Liu, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Tianyi Zhou, Heng Huang
Abstract:
The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
中文: 提出的模态平衡偏好优化(MBPO)框架通过对抗性图像扰动生成困难负样本,并利用带有验证奖励的在线响应,解决了大型多模态模型中的模态不平衡问题,有效提升了视觉语言任务性能并减少了幻觉现象。
English: The proposed Modality-Balancing Preference Optimization (MBPO) framework addresses modality imbalance in Large Multimodal Models by generating hard negatives through adversarial image perturbations and leveraging online responses with verified rewards, enhancing performance on vision-language tasks and reducing hallucinations.
Authors:Chenxi Liu, Tianyi Xiong, Yanshuo Chen, Ruibo Chen, Yihan Wu, Junfeng Guo, Tianyi Zhou, Heng Huang
Abstract:
The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
中文: 提出的模态平衡偏好优化(MBPO)框架通过对抗性图像扰动生成困难负样本,并利用带有验证奖励的在线响应,解决了大型多模态模型中的模态不平衡问题,有效提升了视觉语言任务性能并减少了幻觉现象。
English: The proposed Modality-Balancing Preference Optimization (MBPO) framework addresses modality imbalance in Large Multimodal Models by generating hard negatives through adversarial image perturbations and leveraging online responses with verified rewards, enhancing performance on vision-language tasks and reducing hallucinations.
Authors:Yang Wang, Yin Xu, Cixiao Zhang, Zhiyong Chen, Mingzeng Dai, Haiming Wang, Bingchao Liu, Dazhi He, Meixia Tao
Abstract:
Reconfigurable intelligent surface (RIS) has been recognized as a promising technology for next-generation wireless communications. However, the performance of RIS-assisted systems critically depends on accurate channel state information (CSI). To address this challenge, this letter proposes a novel channel estimation method for RIS-aided millimeter-wave (mmWave) systems based on diffusion models (DMs). Specifically, the forward diffusion process of the original signal is formulated to model the received signal as a noisy observation within the framework of DMs. Subsequently, the channel estimation task is formulated as the reverse diffusion process, and a sampling algorithm based on denoising diffusion implicit models (DDIMs) is developed to enable effective inference. Furthermore, a lightweight neural network, termed BRCNet, is introduced to replace the conventional U-Net, significantly reducing the number of parameters and computational complexity. Extensive experiments conducted under various scenarios demonstrate that the proposed method consistently outperforms existing baselines.
中文: 本文提出了一种基于扩散模型的新型信道估计方法,用于RIS辅助的毫米波系统,该方法将信道估计构建为反向扩散过程,并采用轻量级BRCNet降低复杂度,在多种场景下均优于现有基准方法。
English: This letter introduces a novel channel estimation method for RIS-assisted mmWave systems using diffusion models, which formulates channel estimation as a reverse diffusion process and employs a lightweight BRCNet to reduce complexity while outperforming existing baselines across various scenarios.
Authors:Ren-Jian Wang, Ke Xue, Zeyu Qin, Ziniu Li, Sheng Tang, Hao-Tian Li, Shengcai Liu, Chao Qian
Abstract:
Ensuring safety of large language models (LLMs) is important. Red teaming--a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs--has emerged as a crucial safety evaluation method. Within this framework, the diversity of adversarial prompts is essential for comprehensive safety assessments. We find that previous approaches to red-teaming may suffer from two key limitations. First, they often pursue diversity through simplistic metrics like word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. Second, the common practice of training a single attacker model restricts coverage across potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. Additionally, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including GPT-2, Llama-3, Gemma-2, and Qwen2.5. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.
中文: 本文提出的质量多样性红队测试(QDRT)框架通过行为条件训练和多个专业攻击者,克服了以往红队测试在攻击多样性和有效性上的局限,显著提升了大型语言模型的安全评估能力。
English: This paper introduces Quality-Diversity Red-Teaming (QDRT), a framework that overcomes previous limitations in red-teaming by generating diverse and effective adversarial prompts through behavior-conditioned training and multiple specialized attackers, enhancing LLM safety evaluations.
Authors:Rishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh
Abstract:
The rise of deepfake audio and hate speech, powered by advanced text-to-speech, threatens online safety. We present SynHate, the first multilingual dataset for detecting hate speech in synthetic audio, spanning 37 languages. SynHate uses a novel four-class scheme: Real-normal, Real-hate, Fake-normal, and Fake-hate. Built from MuTox and ADIMA datasets, it captures diverse hate speech patterns globally and in India. We evaluate five leading self-supervised models (Whisper-small/medium, XLS-R, AST, mHuBERT), finding notable performance differences by language, with Whisper-small performing best overall. Cross-dataset generalization remains a challenge. By releasing SynHate and baseline code, we aim to advance robust, culturally sensitive, and multilingual solutions against synthetic hate speech. The dataset is available at https://www.iab-rubric.org/resources.
中文:SynHate是首个针对37种语言合成音频中仇恨言论检测的多语言数据集,采用创新的四分类方案,评估显示Whisper-small模型表现最佳但跨数据集泛化仍存挑战,旨在推动针对合成仇恨言论的稳健解决方案。
English: The SynHate dataset, the first multilingual resource for detecting hate speech in synthetic audio across 37 languages, introduces a novel four-class categorization and reveals performance variations among self-supervised models, with Whisper-small achieving the best results, while cross-dataset generalization remains a challenge.
Authors:Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh
Abstract:
Quantization is essential for deploying large audio language models (LALMs) efficiently in resource-constrained environments. However, its impact on complex tasks, such as zero-shot audio spoofing detection, remains underexplored. This study evaluates the zero-shot capabilities of five LALMs, GAMA, LTU-AS, MERaLiON, Qwen-Audio, and SALMONN, across three distinct datasets: ASVspoof2019, In-the-Wild, and WaveFake, and investigates their robustness to quantization (FP32, FP16, INT8). Despite high initial spoof detection accuracy, our analysis demonstrates severe predictive biases toward spoof classification across all models, rendering their practical performance equivalent to random classification. Interestingly, quantization to FP16 precision resulted in negligible performance degradation compared to FP32, effectively halving memory and computational requirements without materially impacting accuracy. However, INT8 quantization intensified model biases, significantly degrading balanced accuracy. These findings highlight critical architectural limitations and emphasize FP16 quantization as an optimal trade-off, providing guidelines for practical deployment and future model refinement.
中文: 量化对于大型音频语言模型的高效部署至关重要,其中FP16量化在资源减半的同时性能损失可忽略,但INT8量化会加剧模型偏差并显著降低准确性。
English: Quantization is crucial for efficient deployment of large audio language models, with FP16 offering minimal performance loss while halving resource requirements, though INT8 exacerbates model biases and severely degrades accuracy.
Authors:Hamied Nabizada, Tom Jeleniewski, Lasse Beers, Maximilian Weigand, Felix Gehlhoff, Alexander Fay
Abstract:
This paper presents a SysML profile that enables the direct integration of planning semantics based on the Planning Domain Definition Language (PDDL) into system models. Reusable stereotypes are defined for key PDDL concepts such as types, predicates, functions and actions, while formal OCL constraints ensure syntactic consistency. The profile was derived from the Backus-Naur Form (BNF) definition of PDDL 3.1 to align with SysML modeling practices. A case study from aircraft manufacturing demonstrates the application of the profile: a robotic system with interchangeable end effectors is modeled and enriched to generate both domain and problem descriptions in PDDL format. These are used as input to a PDDL solver to derive optimized execution plans. The approach supports automated and model-based generation of planning descriptions and provides a reusable bridge between system modeling and AI planning in engineering design.
中文: 本文提出了一种SysML配置文件,将PDDL规划语义直接集成到系统模型中,通过飞机制造的案例研究,实现了规划描述的自动生成,并在工程设计中架起了系统建模与AI规划之间的可重用桥梁。
English: This paper introduces a SysML profile that integrates PDDL planning semantics into system models, enabling automated generation of planning descriptions and bridging system modeling with AI planning through a case study in aircraft manufacturing.
Authors:Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu
Abstract:
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
中文: 本文提出RoboCerebra基准,通过包含更长任务序列和分层规划框架,旨在解决视觉语言模型在长程机器人操作中高级推理能力未被充分利用的问题。
English: This paper introduces RoboCerebra, a benchmark designed to address the underutilization of vision-language models' high-level reasoning capabilities in long-horizon robotic manipulation by featuring extended task sequences and a hierarchical planning framework.
Authors:Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, Zuxuan Wu
Abstract:
In complex driving environments, autonomous vehicles must navigate safely. Relying on a single predicted path, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each, but face optimization challenges in precisely selecting the best option from thousands of possibilities and distinguishing subtle but safety-critical differences, especially in rare or underrepresented scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5% PDMS in NAVSIM v1 and 87.1% EPDMS in NAVSIM v2 without extra data, demonstrating superior safetycritical capabilities, including collision avoidance and compliance with rules, while maintaining high trajectory quality in various driving scenarios.
Chinese: DriveSuprim通过采用从粗到精的候选轨迹筛选、基于旋转的数据增强和自蒸馏框架,显著提升了自动驾驶车辆的安全性,在多种驾驶场景中实现了最先进的碰撞规避和规则遵循性能。
English: DriveSuprim enhances autonomous vehicle safety by employing a coarse-to-fine candidate filtering, rotation-based augmentation, and self-distillation, achieving state-of-the-art performance in collision avoidance and rule compliance across diverse driving scenarios.
Authors:Sam Earle, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Graham Todd, Andrzej Banburski-Fahey, Julian Togelius
Abstract:
There is much interest in using large pre-trained models in Automatic Game Design (AGD), whether via the generation of code, assets, or more abstract conceptualization of design ideas. But so far this interest largely stems from the ad hoc use of such generative models under persistent human supervision. Much work remains to show how these tools can be integrated into longer-time-horizon AGD pipelines, in which systems interface with game engines to test generated content autonomously. To this end, we introduce ScriptDoctor, a Large Language Model (LLM)-driven system for automatically generating and testing games in PuzzleScript, an expressive but highly constrained description language for turn-based puzzle games over 2D gridworlds. ScriptDoctor generates and tests game design ideas in an iterative loop, where human-authored examples are used to ground the system's output, compilation errors from the PuzzleScript engine are used to elicit functional code, and search-based agents play-test generated games. ScriptDoctor serves as a concrete example of the potential of automated, open-ended LLM-based workflows in generating novel game content.
中文:目前大型预训练模型在自动游戏设计中的应用多依赖人工监督,而ScriptDoctor作为一个基于大语言模型的系统,能在PuzzleScript中通过迭代生成与测试游戏,展现了自动化生成新颖游戏内容的潜力。
English: There is growing interest in using large pre-trained models for Automatic Game Design, but current applications rely heavily on human supervision, whereas ScriptDoctor demonstrates an automated LLM-driven system that iteratively generates and tests games in PuzzleScript, showcasing the potential for open-ended content creation.
Authors:Stefano Scanzio, Gabriele Formis, Pietro Chiavassa, Lukasz Wisniewski, Gianluca Cena
Abstract:
Executable QR codes, or sQRy, is a technology dated 2022 that permits to include a runnable program inside a QR code, enabling interaction with the user even in the absence of an Internet connection. sQRy are enablers for different practical applications, including network equipment configuration, diagnostics, and enhanced smart manuals in industrial contexts. Many other non-industry-related fields can also benefit from this technology. Regardless of where sQRy are used, text strings are among the most commonly embedded data. However, due to strict limitations on the available payload, the occupancy of strings limits the length of the programs that can be embedded. In this work, we propose a simple yet effective strategy that can reduce the space taken by strings, hence broadening sQRy applicability.
Chinese: 2022年问世的sQRy技术可在二维码中嵌入可执行程序实现离线交互,但字符串数据因有效载荷限制影响程序长度,为此提出一种节省空间的策略以拓宽其应用范围。
English: The 2022-developed sQRy technology embeds executable programs in QR codes for offline use, but string data limits program length due to payload constraints, prompting a new space-saving strategy to enhance its applicability.
Authors:Pietro Chiavassa, Stefano Scanzio, Gianluca Cena
Abstract:
To ensure an unprecedented degree of flexibility, next-generation Industry 4.0/5.0 production plants increasingly rely on mobile devices, e.g., autonomous mobile robots and wearables. In these cases, a major requirement is getting rid of cables through the adoption of wireless networks. To this purpose, Wi-Fi is currently deemed one of the most promising solutions. Achieving reliable communications over the air for distributed real-time control applications is, however, not devoid of troubles. In fact, bounded transmission latency must be ensured for most of the exchanged packets. Moreover, for devices powered on batteries, energy consumption also needs to be taken into account. In this paper, a joint simulated analysis of these aspects is carried out to quantitatively evaluate what we can practically expect from Wi-Fi technology.
中文摘要:下一代工业4.0/5.0工厂为提升灵活性日益采用Wi-Fi无线网络连接移动设备,但确保可靠的低延迟通信与能效平衡仍面临挑战。
English Summary: Next-generation Industry 4.0/5.0 plants increasingly use wireless Wi-Fi networks for mobile devices to enhance flexibility, though ensuring reliable low-latency communication and energy efficiency remains challenging.
Authors:John Kirchenbauer, Janny Mongkolsupawan, Yuxin Wen, Tom Goldstein, Daphne Ippolito
Abstract:
When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.
Chinese: 语言模型在训练中同时学习语言结构和事实知识,本研究通过构建合成数据集来区分事实记忆与逐字序列记忆,并揭示了创建逼真虚构数据的挑战。
English: Language models acquire both linguistic structure and factual knowledge during training, and this study introduces a synthetic dataset to distinguish between fact memorization and verbatim sequence memorization, revealing the challenges in creating realistic fictional data.
Authors:Ziyue Zhu, Shenlong Wang, Jin Xie, Jiang-jiang Liu, Jingdong Wang, Jian Yang
Abstract:
Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i) Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancing performance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation. The project page and codes are available at https://zzy816.github.io/VoxelSplat-Demo/.
中文摘要:VoxelSplat框架通过3D高斯泼溅技术,利用2D投影监督增强3D语义学习,并以自监督方式实现场景流估计,在不增加推理时间的情况下显著提升了占据预测的准确性。
English Summary: The VoxelSplat framework enhances camera-based occupancy prediction by leveraging 3D Gaussian Splatting to improve 3D semantic learning through 2D projection supervision and enable self-supervised scene flow estimation, boosting performance without added inference cost.
Authors:Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai
Abstract:
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
中文: DirectLayout是一种创新框架,利用大型语言模型的空间推理能力,直接从文本描述生成开放词汇的3D室内布局,通过多阶段生成和迭代对齐实现了卓越的泛化能力和物理合理性。
English: DirectLayout is a novel framework that leverages large language models for spatial reasoning to generate open-vocabulary 3D indoor layouts directly from text descriptions, achieving superior generalization and physical realism through multi-stage generation and iterative alignment.
Authors:Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
Abstract:
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
Chinese: Common Pile v0.1 是一个八万亿字节的开放许可文本集合,用于大语言模型预训练,基于此训练的模型(如 Comma v0.1)在性能上可与使用非许可数据的模型相媲美。
English: The Common Pile v0.1 is an eight-terabyte collection of openly licensed text for LLM pretraining, and models trained on it, like Comma v0.1, achieve performance competitive with those using unlicensed data.
Authors:Zihan Xu, Mengxian Hu, Kaiyan Xiao, Qin Fang, Chengju Liu, Qijun Chen
Abstract:
Human motion retargeting for humanoid robots, transferring human motion data to robots for imitation, presents significant challenges but offers considerable potential for real-world applications. Traditionally, this process relies on human demonstrations captured through pose estimation or motion capture systems. In this paper, we explore a text-driven approach to mapping human motion to humanoids. To address the inherent discrepancies between the generated motion representations and the kinematic constraints of humanoid robots, we propose an angle signal network based on norm-position and rotation loss (NPR Loss). It generates joint angles, which serve as inputs to a reinforcement learning-based whole-body joint motion control policy. The policy ensures tracking of the generated motions while maintaining the robot's stability during execution. Our experimental results demonstrate the efficacy of this approach, successfully transferring text-driven human motion to a real humanoid robot NAO.
中文摘要:本文提出了一种基于文本驱动的人形机器人运动重定向方法,通过采用带NPR损失的角信号网络和强化学习控制策略,成功将人类动作迁移至NAO机器人并保持其运动稳定性。
English Summary: This paper introduces a text-driven method for humanoid robot motion retargeting using an angle signal network with NPR Loss and reinforcement learning control, successfully transferring human motions to a NAO robot while maintaining stability.
Authors:Tobias Pielok, Bernd Bischl, David Rügamer
Abstract:
Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback-Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art semi-implicit variational inference methods, particularly Kernel Semi-Implicit Variational Inference (KSIVI), have been shown to work in high dimensions, training remains moderately expensive. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein's identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method's bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art SIVI methods in both performance and training efficiency.
Chinese: 本研究提出的核化KL散度估计器通过重要性采样校正降低了梯度方差和良性偏差,从而稳定了半隐式变分推断训练,在效率和性能上均优于现有最优方法。
English: The proposed kernelized KL divergence estimator with importance sampling correction stabilizes SIVI training by reducing gradient variance and benign bias, outperforming state-of-the-art methods in both efficiency and performance.
Authors:Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Abstract:
We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and reward losses. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real/synthetic image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
中文: SeedEdit 3.0 与文生图模型 Seedream 3.0 配合,通过优化数据构建和联合学习流程,显著提升了真实图像编辑的指令遵循与内容保持能力,实现了 56.1% 的最高可用率。
English: SeedEdit 3.0, paired with the T2I model Seedream 3.0, enhances edit instruction adherence and content preservation on real images, achieving a 56.1% usability rate through improved data curation and joint learning pipelines.
Authors:Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu
Abstract:
Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.
中文:UniCUE是一种统一框架,通过整合视觉语义理解直接从语音提示视频生成语音,无需中间文本,并在新构建的大规模汉语数据集上实现了最优性能。
English: UniCUE is a unified framework that directly generates speech from Cued Speech videos by integrating visual-semantic understanding without intermediate text, achieving state-of-the-art performance on a newly constructed Mandarin dataset.
Authors:Tobias Pielok, Bernd Bischl, David Rügamer
Abstract:
Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from complex distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. Although unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and show that UIVI's MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback-Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.
Chinese: 最新研究重新审视了无偏隐式变分推断(UIVI),通过重要性采样替代其MCMC循环并稳定最小化期望前向K-L散度来学习最优提案分布,从而克服计算瓶颈,在现有SIVI基准测试中展现出优于或持平最先进方法的性能。
English: Recent research revisits unbiased implicit variational inference (UIVI), overcoming its computational limitations by replacing the MCMC loop with importance sampling and learning the optimal proposal distribution through stable minimization of the expected forward Kullback-Leibler divergence, achieving performance on par with or superior to current SIVI benchmarks.
Authors:Man Luo, David Cobbley, Xin Su, Shachar Rosenman, Vasudev Lal, Shao-Yen Tseng, Phillip Howard
Abstract:
Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. In this work, we take a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, we introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that our fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.
中文: 本研究开发了一种轻量级视觉语言模型,通过本地化运行提升计算机使用代理的隐私性与效率,并采用自动数据筛选框架进行训练,在基准测试中展现出优越性能。
English: This work introduces a lightweight vision-language model that operates locally to enhance privacy and efficiency in computer use agents, utilizing an automated data filtering framework for training and demonstrating superior performance on benchmarks.
Authors:Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu
Abstract:
Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.
中文: 本研究提出了一种动态比例训练方法,在DeepSeek模型和多种基准测试中验证了其能在保持推理准确性的同时,将输出标记数量减少近40%。
English: This study introduces a dynamic ratio-based training pipeline that reduces output tokens by nearly 40% while preserving reasoning accuracy, validated across DeepSeek models and multiple benchmarks.
Authors:Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, Lenka Zdeborova
Abstract:
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.
中文: 本研究分析了序列单指标模型中随机梯度下降的动态过程,揭示了两个不同的训练阶段,并证明了序列长度和位置编码如何影响基于注意力架构的学习轨迹。
English: This research analyzes stochastic gradient descent dynamics in Sequence Single-Index models, revealing two distinct training phases and demonstrating how sequence length and positional encoding affect learning trajectories in attention-based architectures.
Authors:Zhaoyang Li, Haodong Zhou, Longjie Luo, Xiaoxiao Li, Yongxin Chen, Lin Li, Qingyang Hong
Abstract:
This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%.
Chinese: 本文提出CASA-Net嵌入融合方法,通过交叉注意力和自注意力模块捕捉视听信号的跨模态交互与上下文关系,在说话人日志任务中实现了8.18%的日志错误率,相比基线相对提升47.3%。
English: This paper introduces CASA-Net, an embedding fusion method for audio-visual speaker diarization that uses cross-attention and self-attention modules to capture cross-modal interactions and contextual relationships, achieving a 47.3% relative improvement over the baseline with a diarization error rate of 8.18%.
Authors:Zhaoyang Li, Jie Wang, XiaoXiao Li, Wangjie Li, Longjie Luo, Lin Li, Qingyang Hong
Abstract:
In speaker diarization, traditional clustering-based methods remain widely used in real-world applications. However, these methods struggle with the complex distribution of speaker embeddings and overlapping speech segments. To address these limitations, we propose an Overlapping Community Detection method based on Graph Attention networks and the Label Propagation Algorithm (OCDGALP). The proposed framework comprises two key components: (1) a graph attention network that refines speaker embeddings and node connections by aggregating information from neighboring nodes, and (2) a label propagation algorithm that assigns multiple community labels to each node, enabling simultaneous clustering and overlapping community detection. Experimental results show that the proposed method significantly reduces the Diarization Error Rate (DER), achieving a state-of-the-art 15.94% DER on the DIHARD-III dataset without oracle Voice Activity Detection (VAD), and an impressive 11.07% with oracle VAD.
Chinese: 针对说话人日志中传统聚类方法处理复杂嵌入分布和重叠语音的不足,我们提出OCDGALP框架,通过图注意力网络优化嵌入表示和标签传播算法检测重叠社区,在DIHARD-III数据集上无需预知语音活动即实现15.94%的领先错误率。
English: To overcome the limitations of traditional clustering methods in speaker diarization, we introduce OCDGALP, a novel framework combining graph attention networks for refining embeddings and label propagation for overlapping community detection, which achieves state-of-the-art performance with a 15.94% DER on DIHARD-III without oracle VAD.
Authors:Zehua Liu, Xiaolou Li, Li Guo, Lantian Li, Dong Wang
Abstract:
Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.
Chinese: 本文通过引入规模测试、上下文感知解码和迭代优化,系统性地探索了如何利用大型语言模型提升视觉语音识别性能,从而显著挖掘其潜力并改善识别效果。
English: This paper systematically explores leveraging Large Language Models (LLMs) for Visual Speech Recognition (VSR) by introducing scaling tests, context-aware decoding, and iterative polishing, which collectively enhance performance by harnessing LLMs' potential.
Authors:Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang
Abstract:
This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website.
中文:CNVSRC 2024通过引入更强基线系统和新增数据集,推动了中文连续视觉语音识别在数据处理与模型训练方面的创新进展。
English: CNVSRC 2024 advances Chinese visual speech recognition by introducing enhanced baseline systems and additional datasets, driving innovations in data processing and model training.
Authors:Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Satoshi Asakawa
Abstract:
This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.
中文: Whale语音识别模型融合先进架构与多样化训练数据,在多项基准测试中性能超越Whisper和OWSM等现有模型。
English: The Whale speech recognition model combines advanced architecture and diverse training data to achieve state-of-the-art performance, surpassing models like Whisper and OWSM on key benchmarks.
Authors:Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian
Abstract:
Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.
Chinese: Next-Frame Diffusion (NFD) 提出了一种具有块级因果注意力的自回归扩散变换器,并引入了针对视频的一致性蒸馏和推测采样创新,在A100 GPU上实现了超过30 FPS的高效高质量视频生成。
English: Next-Frame Diffusion (NFD) introduces an autoregressive diffusion transformer with block-wise causal attention and innovations like consistency distillation and speculative sampling, enabling efficient, high-quality video generation at over 30 FPS on an A100 GPU.
Authors:Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li
Abstract:
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.
中文:ReFlow-VC是一种基于整流流的高保真语音转换方法,通过优化包含内容和音高信息的说话人特征,能够沿最直接路径将高斯分布转换为梅尔频谱图分布,在小数据集和零样本场景中表现优异。
English: ReFlow-VC is a high-fidelity speech conversion method using rectified flow that enables efficient transformation from Gaussian noise to Mel-spectrograms while optimizing speaker features with content and pitch information, achieving strong performance in small datasets and zero-shot scenarios.
Authors:Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li
Abstract:
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.
Chinese: 本文提出了一种分层深度滤波网络(HDF-Net),通过结合子带处理和深度滤波来利用目标时频单元及其周围信息进行语音增强,实验证明该模型在减少资源消耗的同时优于其他先进系统。
English: This paper introduces a hierarchical deep filtering network (HDF-Net) that enhances speech by integrating sub-band processing and deep filtering to utilize surrounding time-frequency bin information, achieving superior performance with reduced complexity and resource usage.
Authors:Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu
Abstract:
Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.
中文: 当前多模态大语言模型在视觉问答中因无差别标注对象而存在局限,但提出的FOCUS方法通过结合直觉与分析推理动态适应问题复杂度,在多个基准测试中显著提升了性能。
English: Current multimodal large language models face limitations in Visual Question Answering due to indiscriminate object annotation, but the proposed FOCUS approach dynamically adapts to question complexity by integrating intuitive and analytical reasoning, significantly improving performance across multiple benchmarks.
Authors:Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych
Abstract:
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model's response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.
Chinese: 本研究提出Con Instruction方法,通过生成对抗性图像或音频利用多模态语言模型解析非文本指令的能力,有效绕过安全机制,在LLaVA-v1.5等模型上实现了高达86.6%的攻击成功率。
English: This study introduces Con Instruction, a novel method that generates adversarial images or audio to exploit multimodal language models' ability to interpret non-textual instructions, effectively bypassing safety mechanisms and achieving high attack success rates on models like LLaVA-v1.5.
Authors:Lilit Grigoryan, Vladimir Bataev, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg
Abstract:
Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.
中文: 本文提出了一种通用方法,显著加速了ASR中Transducer模型的束搜索,在保持接近贪心解码速度的同时大幅提升了识别准确率,并增强了浅层融合性能,所有算法均已开源。
English: This paper introduces a universal method to accelerate beam search for Transducer models in ASR, achieving near-greedy decoding speeds with significant accuracy improvements and enhanced shallow fusion capabilities, all while being open-sourced.
Authors:Alireza Salemi, Hamed Zamani
Abstract:
Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models. Our results show that incorporating the personalized context provided leads to up to 39% performance improvements. The benchmark is publicly released to support future research in this area.
中文: LaMP-QA基准的推出填补了个性化长答案生成领域资源匮乏的空白,研究表明融入个性化上下文可使模型在三大类问题上的表现提升最高达39%。
English: The LaMP-QA benchmark is introduced to address the lack of resources for personalized long-form answer generation, showing that incorporating personal context can improve performance by up to 39% across diverse question categories.
Authors:Travis Dick, Alessandro Epasto, Adel Javanmard, Josh Karlin, Andres Munoz Medina, Vahab Mirrokni, Sergei Vassilvitskii, Peilin Zhong
Abstract:
The analysis of the privacy properties of Privacy-Preserving Ads APIs is an area of research that has received strong interest from academics, industry, and regulators. Despite this interest, the empirical study of these methods is hindered by the lack of publicly available data. Reliable empirical analysis of the privacy properties of an API, in fact, requires access to a dataset consisting of realistic API outputs; however, privacy concerns prevent the general release of such data to the public.
In this work, we develop a novel methodology to construct synthetic API outputs that are simultaneously realistic enough to enable accurate study and provide strong privacy protections. We focus on one Privacy-Preserving Ads APIs: the Topics API, part of Google Chrome's Privacy Sandbox. We developed a methodology to generate a differentially-private dataset that closely matches the re-identification risk properties of the real Topics API data. The use of differential privacy provides strong theoretical bounds on the leakage of private user information from this release.
Our methodology is based on first computing a large number of differentially-private statistics describing how output API traces evolve over time. Then, we design a parameterized distribution over sequences of API traces and optimize its parameters so that they closely match the statistics obtained. Finally, we create the synthetic data by drawing from this distribution.
Our work is complemented by an open-source release of the anonymized dataset obtained by this methodology. We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset. We believe that this work will contribute to fostering transparency regarding the privacy properties of Privacy-Preserving Ads APIs.
中文: 本研究开发了一种新颖的方法,生成既逼真又具备差分隐私保护的合成API输出数据,通过开源发布该数据集来促进隐私保护广告API的透明度研究。
English: This study introduces a novel method to generate synthetic, differentially private API outputs that closely mimic real data, enabling accurate privacy analysis while protecting user information, and releases an open-source dataset to promote transparency in Privacy-Preserving Ads APIs research.
Authors:Lars Ullrich, Walter Zimmer, Ross Greer, Knut Graichen, Alois C. Knoll, Mohan Trivedi
Abstract:
While artificial intelligence (AI) is advancing rapidly and mastering increasingly complex problems with astonishing performance, the safety assurance of such systems is a major concern. Particularly in the context of safety-critical, real-world cyber-physical systems, AI promises to achieve a new level of autonomy but is hampered by a lack of safety assurance. While data-driven control takes up recent developments in AI to improve control systems, control theory in general could be leveraged to improve AI safety. Therefore, this article outlines a new perspective on AI safety based on an interdisciplinary interpretation of the underlying data-generation process and the respective abstraction by AI systems in a system theory-inspired and system analysis-driven manner. In this context, the new perspective, also referred to as data control, aims to stimulate AI engineering to take advantage of existing safety analysis and assurance in an interdisciplinary way to drive the paradigm of data control. Following a top-down approach, a generic foundation for safety analysis and assurance is outlined at an abstract level that can be refined for specific AI systems and applications and is prepared for future innovation.
中文摘要:人工智能在快速发展的同时,其安全性问题日益凸显,特别是在关键网络物理系统中;为此提出的"数据控制"新视角,通过跨学科方式结合控制理论与AI工程,旨在建立系统化的安全分析保障框架。
English Summary: AI's rapid advancement raises significant safety concerns, especially in critical cyber-physical systems, prompting a new interdisciplinary approach called "data control" that combines AI engineering with control theory for enhanced safety assurance.
Authors:Qixuan Liu, Shi Qiu, Yinqiao Wang, Xiwen Wu, Kenneth Siu Ho Chok, Chi-Wing Fu, Pheng-Ann Heng
Abstract:
Volumetric medical imaging technologies produce detailed 3D representations of anatomical structures. However, effective medical data visualization and exploration pose significant challenges, especially for individuals with limited medical expertise. We introduce a novel XR-based system with two key innovations: (1) a coordinated visualization module integrating Multi-layered Multi-planar Reconstruction with 3D mesh models and (2) a multimodal interaction framework combining hand gestures with LLM-enabled voice commands. We conduct preliminary evaluations, including a 15-participant user study and expert interviews, to demonstrate the system's abilities to enhance spatial understanding and reduce cognitive load. Experimental results show notable improvements in task completion times, usability metrics, and interaction effectiveness enhanced by LLM-driven voice control. While identifying areas for future refinement, our findings highlight the potential of this immersive visualization system to advance medical training and clinical practice. Our demo application and supplemental materials are available for download at: https://osf.io/bpjq5/.
中文: 本文介绍了一种创新的XR系统,通过整合多层可视化与手势及LLM语音控制的多模态交互,用户研究证明其能有效提升医学应用中的空间理解能力并降低认知负荷。
English: This paper presents an innovative XR system that integrates multi-layered visualization with multimodal interaction using hand gestures and LLM-powered voice commands, demonstrating improved spatial understanding and reduced cognitive load in medical applications through user studies.
Authors:Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
Abstract:
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
中文摘要:本研究提出一种基于检索的提示方法,利用句法相似性进行自动术语抽取,通过在三个专业基准测试中验证,证明句法检索相比语义方法能更可靠地识别术语边界并提升F1分数。
English Summary: This study introduces a retrieval-based prompting method using syntactic similarity for Automatic Term Extraction, demonstrating improved F1-scores across benchmarks by leveraging syntactic rather than semantic cues for better term boundary detection.
Authors:Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda
Abstract:
Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$ compared to the baseline dense model. Code is available at Github.
中文: DuoGPT是一种双稀疏框架,通过结合非结构化权重剪枝和激活稀疏性,并采用激活感知校准及优化GPU执行,为大语言模型实现了更高的准确性和效率。
English: DuoGPT is a dual-sparse framework that combines unstructured weight pruning with activation sparsity, achieving superior accuracy and efficiency for large language models through activation-aware calibration and optimized GPU execution.
Authors:Pasquale De Rosa, Simon Queyrut, Yérom-David Bromberg, Pascal Felber, Valerio Schiavoni
Abstract:
The Ethereum Virtual Machine (EVM) is a decentralized computing engine. It enables the Ethereum blockchain to execute smart contracts and decentralized applications (dApps). The increasing adoption of Ethereum sparked the rise of phishing activities. Phishing attacks often target users through deceptive means, e.g., fake websites, wallet scams, or malicious smart contracts, aiming to steal sensitive information or funds. A timely detection of phishing activities in the EVM is therefore crucial to preserve the user trust and network integrity. Some state-of-the art approaches to phishing detection in smart contracts rely on the online analysis of transactions and their traces. However, replaying transactions often exposes sensitive user data and interactions, with several security concerns. In this work, we present PhishingHook, a framework that applies machine learning techniques to detect phishing activities in smart contracts by directly analyzing the contract's bytecode and its constituent opcodes. We evaluate the efficacy of such techniques in identifying malicious patterns, suspicious function calls, or anomalous behaviors within the contract's code itself before it is deployed or interacted with. We experimentally compare 16 techniques, belonging to four main categories (Histogram Similarity Classifiers, Vision Models, Language Models and Vulnerability Detection Models), using 7,000 real-world malware smart contracts. Our results demonstrate the efficiency of PhishingHook in performing phishing classification systems, with about 90% average accuracy among all the models. We support experimental reproducibility, and we release our code and datasets to the research community.
中文: PhishingHook框架通过分析智能合约的字节码和操作码,运用机器学习技术检测以太坊中的钓鱼活动,在真实恶意合约测试中16种方法的平均准确率约达90%。
English: PhishingHook is a machine learning framework that detects phishing activities in Ethereum smart contracts by analyzing bytecode and opcodes, achieving about 90% accuracy across 16 techniques tested on real-world malware contracts.
Authors:Florian Grötschla, Ahmet Solak, Luca A. Lanzendörfer, Roger Wattenhofer
Abstract:
Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
中文: 本研究通过15,000次人工对比评估了12种主流音乐生成模型,首次基于人类偏好对模型与指标进行排序,并公开数据集以推动主观评价标准的发展。
English: This study evaluates 12 leading music generation models through 15,000 human comparisons, revealing discrepancies between subjective preferences and existing metrics while providing the first human-based ranking and an open dataset for further research.
Authors:Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li, Anxiang Zeng
Abstract:
Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: https://wwenxu.github.io/RDPO/
Chinese: 本文提出真实数据偏好优化(RDPO),一种无需标注的框架,通过从真实视频中反向采样提取物理先验并进行多阶段迭代训练,显著提升了生成视频的动作连贯性和物理真实性,无需依赖昂贵的人工标注。
English: The paper introduces Real Data Preference Optimisation (RDPO), an annotation-free framework that enhances video generation by extracting physical priors from real videos through reverse-sampling and iterative training, significantly improving physical realism and coherence without costly human input.
Authors:Pieter van Goor, Robert Mahony, Manuel Schaller, Karl Worthmann
Abstract:
Koopman-based methods leverage a nonlinear lifting to enable linear regression techniques. Consequently, data generation, learning and prediction is performed through the lens of this lifting, giving rise to a nonlinear manifold that is invariant under the Koopman operator. In data-driven approximation such as Extended Dynamic Mode Decomposition, this invariance is typically lost due to the presence of (finite-data) approximation errors. In this work, we show that reprojections are crucial for reliable predictions. We provide an approach via closest-point projections that ensure consistency with this nonlinear manifold, which is strongly related to a Riemannian metric and maximum likelihood estimates. While these results are already novel for autonomous systems, we present our approach for parametric systems, providing the basis for data-driven bifurcation analysis and control applications.
中文: 库普曼方法通过非线性提升实现线性回归,本研究提出最近点投影法确保与不变流形的一致性,从而提供可靠预测,并将该方法推广至参数化系统,为数据驱动的分岔分析和控制应用奠定基础。
English: Koopman-based methods use nonlinear lifting for linear regression, and this study introduces closest-point projections to maintain consistency with the invariant manifold, enabling reliable predictions and extending the approach to parametric systems for data-driven analysis and control.
Authors:Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell
Abstract:
Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/
Chinese Summary: 该研究提出了一种混合推理框架,将大型语言模型与结构化概率推理相结合,在社交推理游戏《阿瓦隆》中不仅取得了与更大模型相媲美的表现,还首次在受控研究中击败了人类玩家。
English Summary: The study introduces a hybrid reasoning framework that combines large language models with structured probabilistic inference, achieving competitive performance and even surpassing human players in the social deduction game Avalon.
Authors:Zhuo He, Shuang Li, Wenze Song, Longhui Yuan, Jian Liang, Han Li, Kun Gai
Abstract:
Endowing deep models with the ability to generalize in dynamic scenarios is of vital significance for real-world deployment, given the continuous and complex changes in data distribution. Recently, evolving domain generalization (EDG) has emerged to address distribution shifts over time, aiming to capture evolving patterns for improved model generalization. However, existing EDG methods may suffer from spurious correlations by modeling only the dependence between data and targets across domains, creating a shortcut between task-irrelevant factors and the target, which hinders generalization. To this end, we design a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts, and propose \textbf{S}tatic-D\textbf{YN}amic \textbf{C}ausal Representation Learning (\textbf{SYNC}), an approach that effectively learns time-aware causal representations. Specifically, it integrates specially designed information-theoretic objectives into a sequential VAE framework which captures evolving patterns, and produces the desired representations by preserving intra-class compactness of causal factors both across and within domains. Moreover, we theoretically show that our method can yield the optimal causal predictor for each time domain. Results on both synthetic and real-world datasets exhibit that SYNC can achieve superior temporal generalization performance.
中文摘要:SYNC方法通过构建时间感知因果模型,结合信息论目标学习动态因果表征,有效解决数据分布随时间变化的问题,在合成和真实数据集上均展现出优越的时序泛化性能。
English Summary: The proposed SYNC method uses a time-aware causal model and information-theoretic objectives to learn evolving causal representations, effectively addressing distribution shifts over time and achieving superior temporal generalization performance.
Authors:Bohan Tang, Dezhao Luo, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao
Abstract:
The advent of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with closed LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. With this insight, ASL employs a novel SEmantic Estimator (SEE) to compute a semantic reward to train the App agents in generating actions aligned with the semantics of ground truth actions, even when the syntactic forms differ. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments on offline and online smartphone App operation benchmarks show that ASL significantly improves the accuracy and generalisation of App agents over existing methods.
中文摘要:本文提出动作语义学习(ASL)框架,通过训练App智能体理解界面动作的语义而非复制精确语法,显著提升了智能手机应用操作的准确性和跨场景泛化能力。
English Summary: The paper introduces Action Semantics Learning (ASL), a framework that trains App agents to understand the semantic meaning of interface actions rather than replicating exact syntax, improving robustness and generalization across smartphone applications.
Authors:Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo
Abstract:
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
中文: 大语言模型在复杂数学推理中因证明策略单一和错误传播而表现不佳,为此提出的DREAM自适应解决方案通过增强策略多样性和错误反馈机制,有效提升了推理性能。
English: Large language models face challenges in complex mathematical reasoning due to limited proof diversity and error propagation, prompting the development of DREAM, a self-adaptive solution that enhances strategic diversity and error correction to improve performance.
Authors:Jiashun Cheng, Aochuan Chen, Nuo Chen, Ziqi Gao, Yuhan Li, Jia Li, Fugee Tsung
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large foundation models. Despite its successes, the substantial parameter redundancy, which limits the capacity and efficiency of LoRA, has been recognized as a bottleneck. In this work, we systematically investigate the impact of redundancy in fine-tuning LoRA and reveal that reducing density redundancy does not degrade expressiveness. Based on this insight, we introduce \underline{S}pectral-\underline{e}ncoding \underline{L}ow-\underline{R}ank \underline{A}daptation (SeLoRA), which harnesses the robust expressiveness of spectral bases to re-parameterize LoRA from a sparse spectral subspace. Designed with simplicity, SeLoRA enables seamless integration with various LoRA variants for performance boosting, serving as a scalable plug-and-play framework. Extensive experiments substantiate that SeLoRA achieves greater efficiency with fewer parameters, delivering superior performance enhancements over strong baselines on various downstream tasks, including commonsense reasoning, math reasoning, and code generation.
中文: SeLoRA是一种新颖的低秩自适应方法,通过使用谱基对LoRA进行重参数化来减少参数冗余,以更少的参数在多项任务中实现了更高效率和更优性能。
English: SeLoRA is a novel low-rank adaptation method that reduces parameter redundancy by re-parameterizing LoRA using spectral bases, achieving greater efficiency and superior performance across various tasks with fewer parameters.
Authors:Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, Jiawei Han
Abstract:
Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs), as evidenced by recent successes such as OpenAI's o1 and Deepseek-R1. However, applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads. On the other hand, while being powerful, recent studies suggest that RL does not fundamentally endow models with new knowledge; rather, it primarily reshapes the model's output distribution to activate reasoning capabilities latent in the base model. Building on this insight, we hypothesize that the changes in output probabilities induced by RL are largely model-size invariant, opening the door to a more efficient paradigm: training a small model with RL and transferring its induced probability shifts to larger base models. To verify our hypothesis, we conduct a token-level analysis of decoding trajectories and find high alignment in RL-induced output distributions across model scales, validating our hypothesis. Motivated by this, we propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models. Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts. Our findings offer new insights into the nature of RL-driven reasoning and practical strategies for scaling its benefits without incurring its full computational cost. The project page of RAST is available at https://ozyyshr.github.io/RAST/.
中文摘要:强化学习通过重塑输出分布而非增加新知识来提升大语言模型的推理能力,而RAST方法可将从小模型习得的概率调整迁移至大模型,从而以更低资源成本显著增强推理性能。
English Summary: Reinforcement learning enhances LLMs' reasoning by reshaping output distributions rather than adding new knowledge, and the proposed RAST method efficiently transfers these adjustments from small to large models to boost performance with lower resource costs.
Authors:William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane
Abstract:
Existing work on mitigating catastrophic forgetting during large language models (LLMs) fine-tuning for new knowledge instances has primarily focused on preserving performance on previously seen data, while critically overlooking the collapse of essential capabilities instilled through alignment, most notably the model's ability to faithfully express epistemic uncertainty (a property we term 'Ignorance Awareness'). In this work, we formalize the notion of Ignorance Awareness and illustrate that conventional fine-tuning methods can result in substantial activation displacement. This displacement undermines the critical capability of ignorance awareness, leading to undesirable behaviors such as hallucinations. To address this challenge, we introduce SEAT, a simple and principled fine-tuning approach that not only enables the model to effectively acquire new knowledge instances but also preserves its aligned ignorance awareness. SEAT integrates two key components: (1) sparse tuning that constrains activation drift, and (2) a novel entity perturbation method designed to counter knowledge entanglement. Experimental results demonstrate that, across both real-world and synthetic datasets, SEAT significantly outperforms baselines in preserving ignorance awareness while retaining optimal fine-tuning performance, offering a more robust solution for LLM fine-tuning.
中文摘要:本文提出SEAT方法,通过稀疏调优和实体扰动技术,在让大语言模型学习新知识的同时保持其"无知感知"能力——即表达认知不确定性的关键属性,有效解决了传统微调导致的关键能力退化问题。
English Summary: This paper introduces SEAT, a novel fine-tuning method for large language models that preserves their ignorance awareness—the ability to express epistemic uncertainty—while effectively learning new knowledge, addressing the issue of catastrophic forgetting overlooked by conventional approaches.
Authors:Xinyang Li, Siqi Liu, Bochao Zou, Jiansheng Chen, Huimin Ma
Abstract:
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model's exhibited ToM by adjusting in the direction of the attention head.
中文: 本研究采用基于内部机制的可解释性方法,通过构建多模态测试数据集GridToM评估多模态大语言模型的心理理论能力,发现注意力头能区分不同视角的认知信息,且无需训练即可通过调整注意力方向显著提升模型表现。
English: This study introduces an interpretability-driven approach using a multimodal test dataset, GridToM, to evaluate Theory of Mind in multimodal large language models, revealing that attention heads can distinguish cognitive perspectives and be adjusted to enhance ToM capabilities without training.
Authors:Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su
Abstract:
Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.
中文摘要:该研究提出"次级风险"作为良性交互中出现的新型非对抗性大模型失效模式,通过SecLens框架和SecRiskBench基准系统评估这类普遍存在且可迁移的安全漏洞。
English Summary: The study identifies "secondary risks" as a novel class of non-adversarial LLM failures that emerge during benign interactions, proposing the SecLens framework and SecRiskBench benchmark to systematically evaluate these widespread and transferable safety vulnerabilities.
Authors:Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie
Abstract:
Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs' attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO's superior performance, validating its effectiveness in hallucination mitigation of MLLMs.
中文: SymMPO提出了一种对称多模态偏好优化方法,通过直接监督和一致性损失来严格减少多模态大语言模型的幻觉,并在多个基准测试中展现出卓越性能。
English: SymMPO introduces a symmetric multimodal preference optimization method with direct supervision and a consistency loss to rigorously reduce hallucinations in MLLMs, demonstrating superior performance across multiple benchmarks.
Authors:Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie
Abstract:
Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs' attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO's superior performance, validating its effectiveness in hallucination mitigation of MLLMs.
中文: SymMPO提出了一种对称多模态偏好优化方法,通过直接监督和一致性损失来严格减少多模态大语言模型的幻觉,并在多个基准测试中展现出卓越性能。
English: SymMPO introduces a symmetric multimodal preference optimization method with direct supervision and a consistency loss to rigorously reduce hallucinations in MLLMs, demonstrating superior performance across multiple benchmarks.
Authors:Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen
Abstract:
Large language models (LLMs) achieve impressive performance on various knowledge-intensive and complex reasoning tasks in different domains. In certain scenarios like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta parameters between the customized LLM and the corresponding base model. However, existing works either exhibit unsatisfactory performance at high compression ratios or depend on empirical bit allocation schemes. In this work, we propose ADAMIX, an effective adaptive mixed-precision delta-compression framework. We provide a mathematical derivation of quantization error to motivate our mixed-precision compression strategy and formulate the optimal mixed-precision bit allocation scheme as the solution to a 0/1 integer linear programming problem. Our derived bit allocation strategy minimizes the quantization error while adhering to a predefined compression ratio requirement. Experimental results on various models and benchmarks demonstrate that our approach surpasses the best baseline by a considerable margin. On tasks like AIME2024 and GQA, where the norm of $Î\mathbf{W}$ is large and the base model lacks sufficient ability, ADAMIX outperforms the best baseline Delta-CoMe by 22.3% and 6.1% with 7B models, respectively.
中文: DeltaMix是一种自适应混合精度增量压缩框架,通过在SVD空间最小化量化误差,在多个基准测试中始终优于基线方法。
English: DeltaMix is an adaptive mixed-precision delta-compression framework that minimizes quantization error in SVD space, consistently outperforming baseline methods across multiple benchmarks.
Authors:Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen
Abstract:
Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta weights between the customized LLM and the corresponding base model. However, they exhibit inadequate performance at high compression ratios due to their empirical nature. In this work, we introduce DeltaMix, an adaptive mixed-precision delta-compression framework designed to minimize quantization error in the singular value decomposition (SVD) space without imposing additional assumptions. DeltaMix provides a theoretical justification for the necessity of mixed-precision compression and presents a practical quantization solution that involves solving a 0/1 linear integer programming problem alongside a reconstruction target correction method. Experimental results across multiple models and benchmarks illustrate that DeltaMix consistently outperforms all baseline methods. Notably, on tasks such as AIME2024 and GQA, DeltaMix exceeds the performance of the best baseline, Delta-CoMe, by 22.3\% and 6.1\% for 7B parameter models, respectively.
中文: DeltaMix是一种自适应混合精度增量压缩框架,通过在SVD空间最小化量化误差,在多个基准测试中始终优于基线方法。
English: DeltaMix is an adaptive mixed-precision delta-compression framework that minimizes quantization error in SVD space, consistently outperforming baseline methods across multiple benchmarks.
Authors:Weichang Wu, Xiaolu Zhang, Jun Zhou, Yuchen Li, Wenwen Xia
Abstract:
User Behavior Sequence (UBS) modeling is crucial in industrial applications. As data scale and task diversity grow, UBS pretraining methods have become increasingly pivotal. State-of-the-art UBS pretraining methods rely on predicting behavior distributions. The key step in these methods is constructing a selected behavior vocabulary. However, this manual step is labor-intensive and prone to bias. The limitation of vocabulary capacity also directly affects models' generalization ability. In this paper, we introduce Bootstrapping Your Behavior (\model{}), a novel UBS pretraining strategy that predicts an automatically constructed supervision embedding summarizing all behaviors' information within a future time window, eliminating the manual behavior vocabulary selection. In implementation, we incorporate a student-teacher encoder scheme to construct the pretraining supervision effectively. Experiments on two real-world industrial datasets and eight downstream tasks demonstrate that \model{} achieves an average improvement of 3.9\% in AUC and 98.9\% in training throughput. Notably, the model exhibits meaningful attention patterns and cluster representations during pretraining without any label supervision. In our online deployment over two months, the pretrained model improves the KS by about 2.7\% and 7.1\% over the baseline model for two financial overdue risk prediction tasks in the Alipay mobile application, which reduces bad debt risk by millions of dollars for Ant group.
中文: 本文提出Bootstrapping Your Behavior这一新型用户行为序列预训练策略,通过自动构建监督嵌入替代人工词汇选择,在离线实验和支付宝金融风控场景中均取得显著性能提升。
English: The paper introduces Bootstrapping Your Behavior, a novel User Behavior Sequence pretraining strategy that automatically constructs supervision embeddings to eliminate manual vocabulary selection, achieving significant performance improvements in both offline experiments and real-world financial applications.
Authors:Rui Zhang, Qi Meng, Han Wan, Yang Liu, Zhi-Ming Ma, Hao Sun
Abstract:
Computational fluid dynamics (CFD) drives progress in numerous scientific and engineering fields, yet high-fidelity simulations remain computationally prohibitive. While machine learning approaches offer computing acceleration, they typically specialize in single physical systems or require extensive training data, hindering their applicability in highly nonlinear and 3D flow scenarios. To overcome these limitations, we propose OmniFluids, a pure physics pre-trained model that captures fundamental fluid dynamics laws and adapts efficiently to diverse downstream tasks with minimal data. We develop a training framework combining physics-only pre-training, coarse-grid operator distillation, and few-shot fine-tuning. This enables OmniFluids to retain broad physics knowledge while delivering fast and accurate predictions. Architecturally, OmniFluids integrates a mixture of operators, a multi-frame decoder, and factorized Fourier layers, seamlessly incorporating physics-based supervision while allowing efficient and scalable modeling of diverse tasks. Extensive tests on a broad range of 2D and 3D benchmarks show that OmniFluids outperforms state-of-the-art AI-driven methods in terms of flow field prediction and turbulence statistics. It delivers 10--100$\times$ speedups over traditional solvers while maintaining a comparable accuracy and accurately identifies unknown physical parameters from sparse, noisy data. This work demonstrates the potential of training a unified CFD solver exclusively from physics knowledge, offering a new approach for efficient and generalizable modeling across complex fluid systems.
中文:OmniFluids是一种基于物理预训练的模型,能以少量数据高效适应多种流体动力学任务,在精度和速度上超越现有方法,并能从稀疏数据中快速预测和识别物理参数。
English: OmniFluids is a physics-pre-trained model that efficiently adapts to diverse fluid dynamics tasks with minimal data, outperforming existing methods in accuracy and speed while enabling rapid predictions and parameter identification from sparse data.
Authors:Dongwon Jung, Wenxuan Zhou, Muhao Chen
Abstract:
Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
中文: 我们通过从代码执行中提取可验证的推理轨迹来生成高质量思维链监督数据,该方法能有效提升大语言模型在跨领域任务中的可迁移推理能力,同时通过更准确简洁的推理数据减少推理时的冗余输出。
English: Our method generates high-quality chain-of-thought supervision by extracting verifiable reasoning traces from code execution, effectively enhancing LLMs' transferable reasoning abilities across domains while reducing inference length through more accurate and concise reasoning data.
Authors:Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang, Bihui Yu, Junnan Zhu
Abstract:
Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.
中文: ChartReasoner是一种创新的两阶段框架,通过将图表图像转换为结构化代码来保留视觉细节,并利用合成数据进行训练,以更少参数实现高性能,在跨域场景中接近GPT-4o的表现。
English: ChartReasoner is a novel two-stage framework that converts chart images into structured code to preserve visual details and uses synthesized data for training, achieving high performance with fewer parameters and approaching GPT-4o's capabilities in out-of-domain settings.
Authors:Zijie Wu, Chaohui Yu, Fan Wang, Xiang Bai
Abstract:
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
中文: AnimateAnyMesh提出了首个前馈式框架,通过创新的DyMeshVAE架构和Rectified Flow训练策略,能够根据文本输入高效生成任意三维网格的高质量动画,在数秒内实现语义准确且时序连贯的动态效果。
English: AnimateAnyMesh introduces the first feed-forward framework for efficiently animating any 3D mesh from text input, using a novel DyMeshVAE architecture and Rectified Flow training to generate high-quality, semantically accurate animations in seconds.
Authors:Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong
Abstract:
The ongoing evolution of AI paradigms has propelled AI research into the agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasing situational and systemic risks. This has brought significant attention to value alignment for agentic AI systems, which aims to ensure that an agent's goals, preferences, and behaviors align with human values and societal norms. Addressing socio-governance demands through a Multi-level Value framework, this study comprehensively reviews value alignment in LLM-based multi-agent systems as the representative archetype of agentic AI systems. Our survey systematically examines three interconnected dimensions: First, value principles are structured via a top-down hierarchy across macro, meso, and micro levels. Second, application scenarios are categorized along a general-to-specific continuum explicitly mirroring these value tiers. Third, value alignment methods and evaluation are mapped to this tiered framework through systematic examination of benchmarking datasets and relevant methodologies. Additionally, we delve into value coordination among multiple agents within agentic AI systems. Finally, we propose several potential research directions in this field.
中文摘要:随着AI向智能体系统演进,研究重心已转向复杂环境中的多智能体协作,这要求通过多层次框架构建与人类价值观对齐的原则、应用场景及评估方法,确保系统目标符合社会规范。
English Summary: The evolution of AI into agentic systems has shifted focus to multi-agent collaboration in complex environments, necessitating value alignment with human principles through a multi-level framework that structures values, applications, and evaluation methods.
Authors:Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang
Abstract:
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with \mname, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all timesteps of token-level MDP, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
Chinese: 直接对齐算法(如DPO和SimPO)存在奖励与生成间的差距,归因于优化目标与生成性能不匹配,而提出的前缀导向等长训练(POET)方法通过截断响应至等长来解决此问题,在AlpacaEval 2等任务中性能提升高达15.6分。
English: Direct Alignment Algorithms (DAAs) like DPO and SimPO face a reward-generation gap due to misaligned optimization objectives and generation performance, which is addressed by the proposed Prefix-Oriented Equal-length Training (POET) method that truncates responses to equal lengths, improving performance in tasks like AlpacaEval 2 by up to 15.6 points.
Authors:Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, Yue Wang
Abstract:
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender, a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering. We also introduce SkillBench, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/SkillBlender-web/.
中文摘要:SkillBlender是一种分层强化学习框架,通过动态融合预训练的基础技能实现多功能人形机器人移动操作,在减少任务特定调整的同时显著优于现有方法,产生更自然可行的动作。
English Summary: SkillBlender is a hierarchical reinforcement learning framework that blends pretrained primitive skills to enable versatile humanoid loco-manipulation with minimal task-specific tuning, outperforming existing methods while producing more natural and feasible movements.
Authors:Maoyu Wang, Yao Lu, Jiaqi Nie, Zeyu Wang, Yun Lin, Qi Xuan, Guan Gui
Abstract:
With the rapid development of deep learning, a growing number of pre-trained models have been publicly available. However, deploying these fixed models in real-world IoT applications is challenging because different devices possess heterogeneous computational and memory resources, making it impossible to deploy a single model across all platforms. Although traditional compression methods, such as pruning, quantization, and knowledge distillation, can improve efficiency, they become inflexible once applied and cannot adapt to changing resource constraints. To address these issues, we propose ReStNet, a Reusable and Stitchable Network that dynamically constructs a hybrid network by stitching two pre-trained models together. Implementing ReStNet requires addressing several key challenges, including how to select the optimal stitching points, determine the stitching order of the two pre-trained models, and choose an effective fine-tuning strategy. To systematically address these challenges and adapt to varying resource constraints, ReStNet determines the stitching point by calculating layer-wise similarity via Centered Kernel Alignment (CKA). It then constructs the hybrid model by retaining early layers from a larger-capacity model and appending deeper layers from a smaller one. To facilitate efficient deployment, only the stitching layer is fine-tuned. This design enables rapid adaptation to changing budgets while fully leveraging available resources. Moreover, ReStNet supports both homogeneous (CNN-CNN, Transformer-Transformer) and heterogeneous (CNN-Transformer) stitching, allowing to combine different model families flexibly. Extensive experiments on multiple benchmarks demonstrate that ReStNet achieve flexible accuracy-efficiency trade-offs at runtime while significantly reducing training cost.
中文: ReStNet提出了一种可重用和可拼接的网络,通过动态组合预训练模型来适应物联网中不同的计算资源,以最小的训练成本实现灵活的效率与精度权衡。
English: ReStNet introduces a reusable and stitchable network that dynamically combines pre-trained models to adapt to varying computational resources in IoT applications, achieving flexible efficiency-accuracy trade-offs with minimal training cost.
Authors:Rui Zhao, Xingjian Zhang, Yuhong Cao, Yizhuo Wang, Guillaume Sartoretti
Abstract:
In this work, we propose an attention-based deep reinforcement learning approach to address the adaptive informative path planning (IPP) problem in 3D space, where an aerial robot equipped with a downward-facing sensor must dynamically adjust its 3D position to balance sensing footprint and accuracy, and finally obtain a high-quality belief of an underlying field of interest over a given domain (e.g., presence of specific plants, hazardous gas, geological structures, etc.). In adaptive IPP tasks, the agent is tasked with maximizing information collected under time/distance constraints, continuously adapting its path based on newly acquired sensor data. To this end, we leverage attention mechanisms for their strong ability to capture global spatial dependencies across large action spaces, allowing the agent to learn an implicit estimation of environmental transitions. Our model builds a contextual belief representation over the entire domain, guiding sequential movement decisions that optimize both short- and long-term search objectives. Comparative evaluations against state-of-the-art planners demonstrate that our approach significantly reduces environmental uncertainty within constrained budgets, thus allowing the agent to effectively balance exploration and exploitation. We further show our model generalizes well to environments of varying sizes, highlighting its potential for many real-world applications.
中文: 本研究提出了一种基于注意力的深度强化学习方法,用于三维空间中的自适应信息路径规划,使空中机器人能在约束条件下有效平衡感知范围与精度,同时显著降低环境不确定性。
English: This study introduces an attention-based deep reinforcement learning method for adaptive informative path planning in 3D space, enabling aerial robots to efficiently balance sensing coverage and accuracy while reducing environmental uncertainty under constraints.
Authors:Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, Cheng-Hao Kuo
Abstract:
6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object's appearance and geometry, remains challenging. To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model. Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness. We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured. Project page: https://minfenli.github.io/UA-Pose/
中文: UA-Pose提出了一种基于不确定性感知的6D物体姿态估计和在线补全方法,专门针对部分参考数据设计,在物体观测不完整时显著提升了性能表现。
English: UA-Pose introduces an uncertainty-aware method for 6D object pose estimation and online completion using partial references, significantly enhancing accuracy and robustness even with incomplete object data.
Authors:Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
Abstract:
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.
Chinese: OneIG-Bench作为一个全面的基准框架被提出,用于在多维度上对文本到图像模型进行细粒度评估,解决了现有基准的不足,并支持灵活、可复现的模型性能分析。
English: OneIG-Bench is introduced as a comprehensive benchmark framework for fine-grained evaluation of text-to-image models across multiple dimensions, addressing limitations in existing benchmarks and enabling flexible, reproducible analysis of model performance.
Authors:Shamminuj Aktar, Andreas Bärtschi, Abdel-Hameed A. Badawy, Stephan Eidenbenz
Abstract:
Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.
中文摘要:量子图变换器(QGT)通过量子自注意力机制在结构化语言建模中实现了更高的分类准确率和样本效率,相比经典模型在真实数据集上平均提升5.42%的准确率,并减少近50%的标注样本需求。
English Summary: The Quantum Graph Transformer (QGT) integrates quantum self-attention with graph structures to outperform classical and quantum models in sentiment classification, achieving higher accuracy with fewer parameters and improved sample efficiency.
Authors:Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu
Abstract:
The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-text reasoning capabilities. However, existing vision-language models (VLMs) to date offer poor performance for compositional reasoning. This paper presents VLAgent, a vision-language agent system for vision-text compositional reasoning with three novel features. First, VLAgent leverages a pre-trained LLM with few-shot context learning to generate the planning script for each compositional reasoning task and provides a backend engine to generate and perform executable runtime, which maps the planning script into executable code using the VLAgent library for VLAgent executor. Second, VLAgent introduces the SS-parser, which identifies and corrects logic errors embedded in the LLM-generated planning script, to further enhance the quality of script-executable mapping. Third, VLAgent introduces the compositional reasoning output verifier, which validates and refines the output of complex compositional reasoning steps, by leveraging complementary reasoning techniques, e.g., ensemble learning and caption analysis. Extensive experiments are conducted on six visual benchmarks and compared to a dozen of the SoTA visual reasoning models. The results show that VLAgent outperforms existing representative approaches for compositional text-visual reasoning. Our code and datasets with outputs will be made available upon acceptance.
中文总结:VLAgent采用神经符号方法,通过两阶段推理系统、语法语义解析器和执行验证器提升组合式视觉推理能力,在多项基准测试中表现优于现有先进模型。
English Summary: VLAgent is a neuro-symbolic system that enhances compositional visual reasoning through a two-stage reasoning process with syntax-semantic parsing and execution verification, demonstrating superior performance across multiple benchmarks.
Authors:Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu
Abstract:
The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.
中文总结:VLAgent采用神经符号方法,通过两阶段推理系统、语法语义解析器和执行验证器提升组合式视觉推理能力,在多项基准测试中表现优于现有先进模型。
English Summary: VLAgent is a neuro-symbolic system that enhances compositional visual reasoning through a two-stage reasoning process with syntax-semantic parsing and execution verification, demonstrating superior performance across multiple benchmarks.
Authors:Weijie Shi, Han Zhu, Jiaming Ji, Mengze Li, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu, Sirui Han, Yike Guo
Abstract:
Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step's logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at https://huggingface.co/datasets/weijiezz/LegalHK.
LegalReasoner通过分解案件争议点并逐步验证和修正推理过程,提升了法律判决预测的可靠性,基于LegalHK数据集的实验显示其显著提高了与法庭裁决的一致性。
LegalReasoner enhances legal judgment prediction by decomposing cases into dispute points and validating each reasoning step through verification and correction, significantly improving decision accuracy as demonstrated on the LegalHK dataset.
Authors:Ruhan Wang, Zhiyong Wang, Chengkai Huang, Rui Wang, Tong Yu, Lina Yao, John C. S. Lui, Dongruo Zhou
Abstract:
For question-answering (QA) tasks, in-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters - incurring significant communication overhead - or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.
Chinese: Fed-ICL 是一种联邦学习框架,通过客户端与服务器的多轮交互优化问答任务中的上下文学习效果,无需传输模型参数即可提升回答质量,在保证低通信成本的同时实现了优异性能。
English: Fed-ICL is a federated framework that improves in-context learning for question-answering by enabling iterative client-server collaboration to refine responses without transmitting model parameters, achieving strong performance with low communication costs.
Authors:Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
Abstract:
Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.
中文摘要:Table-r1是一种新颖的两阶段程序化推理方法,通过提升表格布局泛化能力和推理一致性来增强小语言模型在表格推理任务中的表现,实现了与大型模型相媲美的准确率。
English Summary: Table-r1 is a novel two-stage program-based reasoning method designed to enhance small language models' performance on table reasoning tasks by improving layout generalization and reasoning consistency, achieving competitive accuracy with large models.
Authors:Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianlong Wu, Liqiang Nie
Abstract:
Large Language Models (LLMs) face significant challenges in edge deployment due to their massive parameter scale. Vector Quantization (VQ), a clustering-based quantization method, serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy. Since a vector is a quantity in mathematics and physics that has both direction and magnitude, existing VQ works typically quantize them in a coupled manner. However, we find that direction exhibits significantly greater sensitivity to quantization compared to the magnitude. For instance, when separately clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively. This gap even increases with the reduction of clustering centers. Further, Euclidean distance, a common metric to access vector similarities in current VQ works, places greater emphasis on reducing the magnitude error. This property is contrary to the above finding, unavoidably leading to larger quantization errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework consisting of two key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors into their polar coordinate representations and perform independent quantization of the direction and magnitude parameters.2) Distribution Aligned Codebook Construction (DACC), which optimizes the direction and magnitude codebooks in accordance with the source distribution. Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\% zero-shot accuracy, establishing a novel paradigm for accurate and highly compressed LLMs.
Chinese: 本文提出极坐标解耦向量量化(PCDVQ)框架,通过在极坐标系中独立量化向量的方向和幅度参数,显著提升了大型语言模型在2位量化下的精度,零样本准确率比基线方法至少高出1.5%。
English: This paper introduces Polar Coordinate Decoupled Vector Quantization (PCDVQ), a framework that independently quantizes vector directions and magnitudes in polar coordinates to enhance accuracy in 2-bit quantization of Large Language Models, achieving at least 1.5% higher zero-shot accuracy than baseline methods.
Authors:Gabriele Magrini, Niccolò Marini, Federico Becattini, Lorenzo Berlincioni, Niccolò Biondi, Pietro Pala, Alberto Del Bimbo
Abstract:
Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.
Chinese: 佛罗伦萨RGB-事件无人机数据集(FRED)通过提供超过7小时结合RGB和事件流的多模态数据,专门用于复杂条件下的无人机检测、跟踪与轨迹预测,弥补了现有基准数据在时间分辨率和无人机运动模式方面的不足。
English: The Florence RGB-Event Drone dataset (FRED) addresses the limitations of existing benchmarks by providing over 7 hours of multimodal data combining RGB and event streams, specifically designed for drone detection, tracking, and trajectory forecasting under challenging conditions.
Authors:Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, Guanhua Chen
Abstract:
Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.
中文总结:AR-Checker框架通过动态生成保持语义的数学问题变体,在多领域推理任务中有效检测并揭示了大语言模型的鲁棒性缺陷。
English Summary: The AR-Checker framework dynamically generates semantic-preserving mathematical problem variants to effectively test and reveal robustness limitations in large language models across multiple reasoning tasks.
Authors:Gabriele Magrini, Federico Becattini, Luca Cultrera, Lorenzo Berlincioni, Pietro Pala, Alberto Del Bimbo
Abstract:
Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.
中文:Spike-TBR是一种新颖的事件驱动编码策略,通过将脉冲神经元与时间二进制表示相结合,有效提升了事件流处理在计算机视觉应用中的抗噪能力和鲁棒性。
English: Spike-TBR is a novel event-based encoding strategy that integrates spiking neurons with Temporal Binary Representation to enhance noise resilience and robustness in event stream processing for computer vision applications.
Authors:Yuzhen Ding, Jason Holmes, Hongying Feng, Martin Bues, Lisa A. McGee, Jean-Claude M. Rwigema, Nathan Y. Yu, Terence S. Sio, Sameer R. Keole, William W. Wong, Steven E. Schild, Jonathan B. Ashman, Sujay A. Vora, Daniel J. Ma, Samir H. Patel, Wei Liu
Abstract:
Purpose: Intensity-modulated proton therapy (IMPT) offers precise tumor coverage while sparing organs at risk (OARs) in head and neck (H&N) cancer. However, its sensitivity to anatomical changes requires frequent adaptation through online adaptive radiation therapy (oART), which depends on fast, accurate dose calculation via Monte Carlo (MC) simulations. Reducing particle count accelerates MC but degrades accuracy. To address this, denoising low-statistics MC dose maps is proposed to enable fast, high-quality dose generation.
Methods: We developed a diffusion transformer-based denoising framework. IMPT plans and 3D CT images from 80 H&N patients were used to generate noisy and high-statistics dose maps using MCsquare (1 min and 10 min per plan, respectively). Data were standardized into uniform chunks with zero-padding, normalized, and transformed into quasi-Gaussian distributions. Testing was done on 10 H&N, 10 lung, 10 breast, and 10 prostate cancer cases, preprocessed identically. The model was trained with noisy dose maps and CT images as input and high-statistics dose maps as ground truth, using a combined loss of mean square error (MSE), residual loss, and regional MAE (focusing on top/bottom 10% dose voxels). Performance was assessed via MAE, 3D Gamma passing rate, and DVH indices.
Results: The model achieved MAEs of 0.195 (H&N), 0.120 (lung), 0.172 (breast), and 0.376 Gy[RBE] (prostate). 3D Gamma passing rates exceeded 92% (3%/2mm) across all sites. DVH indices for clinical target volumes (CTVs) and OARs closely matched the ground truth.
Conclusion: A diffusion transformer-based denoising framework was developed and, though trained only on H&N data, generalizes well across multiple disease sites.
中文: 基于扩散变换器的去噪框架显著提升了低统计量蒙特卡罗剂量图的质量,尽管仅使用头颈部数据训练,该模型在多种癌症部位均能实现高精度剂量计算,适用于在线自适应质子治疗。
English: A diffusion transformer-based denoising framework effectively enhances low-statistics Monte Carlo dose maps for online adaptive proton therapy, achieving high accuracy across multiple cancer sites despite being trained solely on head and neck data.
Authors:Yuxin Zhang, Yan Wang, Yongrui Chen, Shenyu Zhang, Xinbang Dai, Sheng Bi, Guilin Qi
Abstract:
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating "magic mushroom" noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at https://drive.google.com/file/d/1aP5kyPuk4L-L_uoI6T9UhxuTyt8oMqjT/view?usp=sharing.
中文: 检索增强生成系统通过整合外部信息提升大语言模型性能,但易受复杂检索噪声影响,因此推出Magic Mushroom基准来模拟现实噪声并评估系统鲁棒性。
English: Retrieval-Augmented Generation (RAG) systems improve Large Language Models by integrating external information but are vulnerable to complex retrieval noise, prompting the introduction of the Magic Mushroom benchmark to simulate real-world noise and assess robustness.
Authors:Zhenhui Liu, Chunyuan Yuan, Ming Pang, Zheng Fang, Li Yuan, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo, Jingping Shao
Abstract:
Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements, playing a crucial role in e-commerce search advertising. The diversity of user needs and expressions often produces massive long-tail queries that cannot be matched with merchant bidwords or product titles, which results in some advertisements not being recalled, ultimately harming user experience and search efficiency. Existing query rewriting research focuses on various methods such as query log mining, query-bidword vector matching, or generation-based rewriting. However, these methods often fail to simultaneously optimize the relevance and authenticity of the user's original query and rewrite and maximize the revenue potential of recalled ads.
In this paper, we propose a Multi-objective aligned Bidword Generation Model (MoBGM), which is composed of a discriminator, generator, and preference alignment module, to address these challenges. To simultaneously improve the relevance and authenticity of the query and rewrite and maximize the platform revenue, we design a discriminator to optimize these key objectives. Using the feedback signal of the discriminator, we train a multi-objective aligned bidword generator that aims to maximize the combined effect of the three objectives. Extensive offline and online experiments show that our proposed algorithm significantly outperforms the state of the art. After deployment, the algorithm has created huge commercial value for the platform, further verifying its feasibility and robustness.
中文: 本文提出的多目标对齐竞价词生成模型(MoBGM)通过判别器-生成器框架同步优化查询相关性、真实性和广告收益,实验证明该模型显著提升了电商搜索广告效果并创造了巨大商业价值。
English: The proposed Multi-objective aligned Bidword Generation Model (MoBGM) enhances e-commerce search advertising by simultaneously optimizing query relevance, authenticity, and ad revenue through its discriminator-generator framework, demonstrating significant commercial value in experiments.
Authors:Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, Liqiang Nie
Abstract:
Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.
中文: 本文提出了一个结合SpatialMind结构化提示与ScanForgeQA自动生成数据集的统一框架,在不改变预训练视觉语言模型架构的前提下增强其三维空间推理能力,通过实验验证有效解决了空间不确定性和数据稀缺问题。
English: This paper introduces a unified framework combining SpatialMind's structured prompting and ScanForgeQA's automated dataset to enhance 3D spatial reasoning in pre-trained vision-language models without architectural changes, effectively addressing spatial uncertainty and data scarcity through experimental validation.
Authors:Shivani Chiranjeevi, Hossein Zaremehrjerdi, Zi K. Deng, Talukder Z. Jubery, Ari Grele, Arti Singh, Asheesh K Singh, Soumik Sarkar, Nirav Merchant, Harold F. Greeney, Baskar Ganapathysubramanian, Chinmay Hegde
Abstract:
The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.
中文: TerraIncognita是一个动态基准,旨在评估多模态AI模型从图像中识别未知昆虫物种的能力,尽管在精细分类级别性能急剧下降,但它能有效进行层级分类并检测新物种。
English: TerraIncognita is a dynamic benchmark that evaluates multimodal AI models for identifying unknown insect species from images, highlighting their proficiency in hierarchical classification and detection of novel species despite a sharp performance decline at finer taxonomic levels.
Authors:Rachid Zeghlache, Ikram Brahim, Pierre-Henri Conze, Mathieu Lamard, Mohammed El Amine Lazouni, Zineb Aziza Elaouaber, Leila Ryma Lazouni, Christopher Nielsen, Ahmad O. Ahsan, Matthias Wilms, Nils D. Forkert, Lovre Antonio Budimir, Ivana MatovinoviÄ, Donik VrÅ¡nak, Sven LonÄariÄ, Philippe Zhang, Weili Jiang, Yihao Li, Yiding Hao, Markus Frohmann, Patrick Binder, Marcel Huber, Taha Emre, Teresa Finisterra Araújo, Marzieh Oghbaie, Hrvoje BogunoviÄ, Amerens A. Bekkers, Nina M. van Liebergen, Hugo J. Kuijf, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Alberto J. Beltrán-Carrero, Juan J. Gómez-Valverde, Javier Torresano-RodrÃquez, Ãlvaro Caballero-Sastre, MarÃa J. Ledesma Carbayo, Yosuke Yamagishi, Yi Ding, Robin Peretzke, Alexandra Ertl, Maximilian Fischer, Jessica Kächele, Sofiane Zehar, Karim Boukli Hacene, Thomas Monfort, Béatrice Cochener, Mostafa El Habib Daho, Anas-Alexis Benyoussef, Gwenolé Quellec
Abstract:
The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge's structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).
中文摘要:MICCAI 2024的MARIO挑战赛通过多模态数据评估AMD监测算法,结果显示人工智能在追踪现有病变进展方面达到医生水平,但尚无法预测未来病情发展。
English Summary: The MARIO challenge at MICCAI 2024 benchmarked AI algorithms for detecting AMD progression through OCT analysis, demonstrating AI's parity with physicians in monitoring current disease activity but limitations in predicting future evolution.
Authors:Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, An Fu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in code editing, substantially enhancing software development productivity. However, the inherent complexity of code editing tasks forces existing approaches to rely on LLMs' autoregressive end-to-end generation, where decoding speed plays a critical role in efficiency. While inference acceleration techniques like speculative decoding are applied to improve the decoding efficiency, these methods fail to account for the unique characteristics of code editing tasks where changes are typically localized and existing code segments are reused. To address this limitation, we propose EfficientEdit, a novel method that improves LLM-based code editing efficiency through two key mechanisms based on speculative decoding: (1) effective reuse of original code segments while identifying potential edit locations, and (2) efficient generate edit content via high-quality drafts from edit-oriented draft models and a dynamic verification mechanism that balances quality and acceleration. Experimental results show that EfficientEdit can achieve up to 10.38$\times$ and 13.09$\times$ speedup compared to standard autoregressive decoding in CanItEdit and CodeIF-Bench, respectively, outperforming state-of-the-art inference acceleration approaches by up to 90.6%.
中文摘要:EfficientEdit是一种新颖方法,通过基于推测解码重用原始代码段并生成高质量草稿,显著提升了基于大语言模型的代码编辑效率,相比现有方法实现了大幅加速。
English Summary: EfficientEdit is a novel method that enhances the efficiency of LLM-based code editing by leveraging speculative decoding to reuse original code segments and generate high-quality drafts, achieving significant speed improvements over existing approaches.
Authors:Yirao Zhao, Guizhen Chen, Kenji Kawaguchi, Lidong Bing, Wenxuan Zhang
Abstract:
Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model's general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a $\underline{Cus}$tom $\underline{Prun}$ing method ($\texttt{Cus-Prun}$) to prune a large general model into a smaller lightweight expert model, which is positioned along the "language", "domain" and "task" dimensions. By identifying and pruning irrelevant neurons of each dimension, $\texttt{Cus-Prun}$ creates expert models without any post-training. Our experiments demonstrate that $\texttt{Cus-Prun}$ consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.
中文摘要:Cus-Prun方法通过剪裁语言、领域和任务维度上的无关神经元,将大型通用模型高效压缩为轻量级专家模型,无需后续训练即可实现卓越性能。
English Summary: The Cus-Prun method efficiently trims large language models into compact expert versions by removing irrelevant neurons across language, domain, and task dimensions, achieving superior performance without post-training.
Authors:Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Minfeng Zhu, Bo Zhang, Wei Chen
Abstract:
Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.
中文: 该摘要提出了一种名为"多模态深度研究器"的创新框架,通过引入结构化表示"可视化形式描述"(FDV)来解决文本与可视化内容自动生成的空白领域,使大语言模型能够创建高质量多模态报告,并在实验中展现出比基线方法高出82%的优越性能。
English: This abstract introduces a novel framework called Multimodal DeepResearcher that addresses the underexplored task of automated generation of interleaved texts and visualizations by proposing a structured representation called Formal Description of Visualization (FDV), which enables LLMs to create high-quality multimodal reports and demonstrates superior performance with an 82% win rate over baseline methods.
Authors:Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or
Abstract:
Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
Chinese: 本文提出了一种阶段感知提示分解框架,利用大型语言模型解决文本到图像扩散模型中的上下文矛盾问题,通过将代理提示与去噪过程对齐来提升语义准确性。
English: This paper introduces a stage-aware prompt decomposition framework that uses large language models to resolve contextual contradictions in text-to-image diffusion models, improving semantic accuracy by aligning proxy prompts with the denoising process.
Authors:Shikun Sun, Min Zhou, Zixuan Wang, Xubin Li, Tiezheng Ge, Zijie Ye, Xiaoyu Qin, Junliang Xing, Bo Zheng, Jia Jia
Abstract:
With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function's Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.
中文摘要:提出的最小影响ControlNet通过构建平衡数据集、均衡注入特征信号以及优化雅可比矩阵不对称性,解决了多控制图像生成中的信号冲突问题,显著提升了静默控制区域的生成兼容性。
English Summary: The proposed Minimal Impact ControlNet addresses conflicts in multi-control image generation by implementing balanced dataset construction, feature signal injection, and Jacobian matrix optimization to enhance compatibility in areas with silent control signals.
Authors:Zhiyuan Wang, Bokui Chen, Yinya Huang, Qingxing Cao, Ming He, Jianping Fan, Xiaodan Liang
Abstract:
Operations research (OR) is widely deployed to solve critical decision-making problems with complex objectives and constraints, impacting manufacturing, logistics, finance, and healthcare outcomes. While Large Language Models (LLMs) have shown promising results in various domains, their practical application in industry-relevant operations research (OR) problems presents significant challenges and opportunities. Preliminary industrial applications of LLMs for operations research face two critical deployment challenges: 1) Self-correction focuses on code syntax rather than mathematical accuracy, causing costly errors; 2) Complex expert selection creates unpredictable workflows that reduce transparency and increase maintenance costs, making them impractical for time-sensitive business applications. To address these business limitations, we introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning. Our approach emulates human cognition, implementing an end-to-end workflow that systematically transforms requirements into mathematical models and executable solver code. It is currently being tested internally in Lenovo's AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers. Experiments demonstrate that ORMind outperforms existing methods, achieving a 9.5\% improvement on the NL4Opt dataset and a 14.6\% improvement on the ComplexOR dataset.
中文:ORMind这一受认知启发的框架通过反事实推理解决了大语言模型在运筹学中的数学精度和工作流透明度问题,在NL4Opt和ComplexOR数据集上分别实现了9.5%和14.6%的性能提升。
English: ORMind, a cognitive-inspired framework using counterfactual reasoning, addresses LLMs' limitations in operations research by improving mathematical accuracy and workflow transparency, achieving performance gains of 9.5% and 14.6% on benchmark datasets.
Authors:Sheng Liang, Yongyue Zhang, Yaxiong Wu, Ruiming Tang, Yong Liu
Abstract:
Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs), typically outputting structured information based on predefined schemas such as JSON or tables. UIE suffers from a lack of adaptability when selecting between predefined schemas and on-the-fly schema generation within the in-context learning paradigm, especially when there are numerous schemas to choose from. In this paper, we propose a unified adaptive text-to-structure generation framework, called Schema as Parameterized Tools (SPT), which reimagines the tool-calling capability of LLMs by treating predefined schemas as parameterized tools for tool selection and parameter filling. Specifically, our SPT method can be applied to unify closed, open, and on-demand IE tasks by adopting Schema Retrieval by fetching the relevant schemas from a predefined pool, Schema Filling by extracting information and filling slots as with tool parameters, or Schema Generation by synthesizing new schemas with uncovered cases. Experiments show that the SPT method can handle four distinct IE tasks adaptively, delivering robust schema retrieval and selection performance. SPT also achieves comparable extraction performance to LoRA baselines and current leading UIE systems with significantly fewer trainable parameters.
中文: 本文提出Schema as Parameterized Tools (SPT)框架,将预定义模式视为参数化工具,统一处理封闭、开放和按需信息抽取任务,以更少参数实现稳健性能。
English: The paper introduces Schema as Parameterized Tools (SPT), a unified framework that treats predefined schemas as parameterized tools to adaptively handle closed, open, and on-demand information extraction tasks, achieving robust performance with fewer parameters.
Authors:Yongdong chi, Hanqing Wang, Zonghan Yang, Jian Yang, Xiao Yan, Yun Chen, Guanhua Chen
Abstract:
Text-to-SQL transforms the user queries from natural language to executable SQL programs, enabling non-experts to interact with complex databases. Existing prompt-based methods craft meticulous text guidelines and examples to facilitate SQL generation, but their accuracy is hindered by the large semantic gap between the texts and the low-resource SQL programs. In this work, we propose Pi-SQL, which incorporates the high-resource Python program as a pivot to bridge between the natural language query and SQL program. In particular, Pi-SQL first generates Python programs that provide fine-grained step-by-step guidelines in their code blocks or comments, and then produces an SQL program following the guidance of each Python program. The final SQL program matches the reference Python program's query results and, through selection from candidates generated by different strategies, achieves superior execution speed, with a reward-based valid efficiency score up to 4.55 higher than the best-performing baseline. Extensive experiments demonstrate the effectiveness of Pi-SQL, which improves the execution accuracy of the best-performing baseline by up to 3.20.
中文: Pi-SQL通过引入Python程序作为中间桥梁,缩小自然语言与SQL之间的语义差距,利用逐步指导和候选选择策略,显著提升了执行准确性和效率。
English: Pi-SQL enhances Text-to-SQL conversion by using Python programs as an intermediary to bridge the semantic gap, improving accuracy and execution speed through step-by-step guidance and candidate selection.
Authors:Jingyu Liu, Jingquan Peng, xiaopeng Wu, Xubin Li, Tiezheng Ge, Bo Zheng, Yong Liu
Abstract:
Despite the widespread application of Large Language Models (LLMs) across various domains, they frequently exhibit overconfidence when encountering uncertain scenarios, yet existing solutions primarily rely on evasive responses (e.g., "I don't know") overlooks the opportunity of identifying and addressing the uncertainty to generate more satisfactory responses. To systematically investigate and improve LLMs' ability of recognizing and addressing the source of uncertainty, we introduce \textbf{ConfuseBench}, a benchmark mainly focus on three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experiments with ConfuseBench reveal that current LLMs struggle to accurately identify the root cause of uncertainty and solve it. They prefer to attribute uncertainty to query ambiguity while overlooking capability limitations, especially for those weaker models. To tackle this challenge, we first generate context-aware inquiries that highlight the confusing aspect of the original query. Then we judge the source of uncertainty based on the uniqueness of the inquiry's answer. Further we use an on-policy training method, InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of our approach.
中文:大型语言模型常错误地将不确定性归因于查询模糊性而忽略自身能力局限,但采用情境感知询问和在线策略训练的新方法有望提升其识别与应对不确定性根源的能力。
English: Large Language Models often misidentify the root causes of uncertainty, primarily attributing it to query ambiguity while overlooking their own capability limitations, but a new approach using context-aware inquiries and on-policy training shows promise in improving their recognition and response accuracy.
Authors:Mingxuan Liu, Tyler L. Hayes, Massimiliano Mancini, Elisa Ricci, Riccardo Volpi, Gabriela Csurka
Abstract:
Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.
中文:提出的词汇适配器(VocAda)通过分析图像描述自动筛选相关类别,在无需训练的情况下优化用户定义词汇,并在多个数据集上持续提升检测性能。
English: The proposed Vocabulary Adapter (VocAda) refines user-defined vocabularies at inference time by analyzing image captions to select relevant classes, consistently enhancing detection performance across multiple datasets without requiring training.
Authors:Reihaneh Zohrabi, Hosein Hasani, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach, Mohammad Hossein Rohban
Abstract:
Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.7% and FPR@95 by 9.3% over the second best.
中文: SPROD是一种基于原型的新型分布外检测方法,无需额外数据或参数调整即可有效消除伪相关带来的偏差,在多个基准测试中实现了显著的性能提升。
English: SPROD is a novel prototype-based OOD detection method that effectively mitigates bias from spurious correlations without requiring extra data or parameter adjustments, achieving significant performance improvements across multiple benchmarks.
Authors:Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Di Lin, Ivor Tsang, Lei Ma
Abstract:
In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.
中文: 本研究提出了时变图像修复任务(TAMP),旨在利用时间差异显著的参考图像修复受损目标图像,并开发了InDiTE-Diff方法,通过交互式分布转换估计与扩散模型相结合,有效解决了内容差异问题,在性能上超越了现有最优方法。
English: This study introduces Time-vAriant iMage inPainting (TAMP), a task to restore damaged images using time-separated reference images, and proposes the InDiTE-Diff method that integrates interactive distribution transition estimation with diffusion models to effectively address content discrepancies and outperform existing techniques.
Authors:Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Abstract:
Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a Part-of-Speech PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.
中文: 针对文本到图像模型在组合生成中难以准确表达空间关系的问题,我们提出了基于优势概率(PoS)的概率框架,通过新型评估指标PSE(更符合人类判断)和无需微调的生成方法PSG,显著提升了模型对空间配置的生成准确性。
English: Text-to-image models often struggle with accurately representing spatial relationships in compositional generation, so we propose a novel probabilistic framework using Probability of Superiority (PoS) to introduce both an evaluation metric (PSE) that better aligns with human judgment and a generation method (PSG) that enhances spatial accuracy without requiring model fine-tuning.
Authors:Runtian Yuan, Qingqiu Li, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen
Abstract:
We present our solution for the Multi-Source COVID-19 Detection Challenge, which aims to classify chest CT scans into COVID and Non-COVID categories across data collected from four distinct hospitals and medical centers. A major challenge in this task lies in the domain shift caused by variations in imaging protocols, scanners, and patient populations across institutions. To enhance the cross-domain generalization of our model, we incorporate Variance Risk Extrapolation (VREx) into the training process. VREx encourages the model to maintain consistent performance across multiple source domains by explicitly minimizing the variance of empirical risks across environments. This regularization strategy reduces overfitting to center-specific features and promotes learning of domain-invariant representations. We further apply Mixup data augmentation to improve generalization and robustness. Mixup interpolates both the inputs and labels of randomly selected pairs of training samples, encouraging the model to behave linearly between examples and enhancing its resilience to noise and limited data. Our method achieves an average macro F1 score of 0.96 across the four sources on the validation set, demonstrating strong generalization.
中文: 针对多源COVID-19检测挑战,我们通过引入方差风险外推法和混合数据增强技术,有效提升了模型在四家医院数据间的跨域泛化能力,最终在验证集上取得0.96的平均宏观F1分数,成功学习了域不变特征并增强了鲁棒性。
English: Our solution for the Multi-Source COVID-19 Detection Challenge integrates Variance Risk Extrapolation and Mixup data augmentation to enhance cross-domain generalization, achieving a 0.96 average macro F1 score across four hospital datasets by learning domain-invariant features and improving robustness.
Authors:Chengyu Dong, Huan Gui, Noveen Sachdeva, Long Jin, Ke Yin, Jingbo Shang, Lichan Hong, Ed H. Chi, Zhe Zhao
Abstract:
Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models. However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale. In this paper, we address this challenge for pretrained Vision Transformers (ViTs) by exploring methods to fine-tune them for more effective knowledge transfer. Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning. For small or highly-imbalanced downstream datasets where such optimization becomes less effective, we introduce a simple yet effective heuristic of reweighting MLP blocks. This approach is inspired by our observation that top MLP blocks are primarily responsible for mutual information loss. Our method enables small student models to benefit from those pretrained models among the strongest.
Chinese: 本文通过引入互信息感知优化和MLP模块重加权方法,有效提升了大规模预训练视觉Transformer的知识蒸馏效果,尤其适用于小型或不平衡数据集的学生模型。
English: This paper enhances knowledge distillation from large-scale pretrained Vision Transformers by introducing mutual information-aware optimization and MLP block reweighting to improve transfer effectiveness, especially for small or imbalanced datasets.
Authors:Qi Li, Kun Li, Haozhi Han, Liang Yuan, Junshi Chen, Yunquan Zhang, Yifeng Chen, Hong An, Ting Cao, Mao Yang
Abstract:
Sparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques: (1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline; (2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints; (3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping. Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.
中文: SparStencil是首个通过将不规则稀疏模式重构为结构化2:4稀疏矩阵来适配稀疏张量核心的科学模板计算系统,在降低代码复杂度的同时实现了最高7.1倍的性能提升。
English: SparStencil is the first system to adapt sparse tensor cores for scientific stencil computations by restructuring irregular sparsity into structured 2:4 patterns, achieving up to 7.1x speedup while reducing code complexity.
Authors:Tao Tang, Likui Zhang, Youpeng Wen, Kaidong Zhang, Jia-Wang Bian, xia zhou, Tianyi Yan, Kun Zhan, Peng Jia, Hefeng Wu, Liang Lin, Xiaodan Liang
Abstract:
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.
中文: RoboPearls是一个基于3D高斯泼溅技术的可编辑视频仿真框架,能够从演示视频构建逼真的机器人操作模拟,并通过集成大语言模型实现自动化流程和视觉语言模型进行性能分析,有效弥合仿真与现实的差距。
English: RoboPearls is an editable video simulation framework that uses 3D Gaussian Splatting to create realistic robotic manipulation simulations from demonstration videos, incorporating large language models for automated production and vision-language models for performance analysis to bridge the sim-to-real gap.
Authors:Yannick Werner, Akash Malemath, Mengxi Liu, Vitor Fortes Rey, Nikolaos Palaiodimopoulos, Paul Lukowicz, Maximilian Kiefer-Emmanouilidis
Abstract:
Kolmogorov Arnold Networks (KANs), built upon the Kolmogorov Arnold representation theorem (KAR), have demonstrated promising capabilities in expressing complex functions with fewer neurons. This is achieved by implementing learnable parameters on the edges instead of on the nodes, unlike traditional networks such as Multi-Layer Perceptrons (MLPs). However, KANs potential in quantum machine learning has not yet been well explored. In this work, we present an implementation of these KAN architectures in both hybrid and fully quantum forms using a Quantum Circuit Born Machine (QCBM). We adapt the KAN transfer using pre-trained residual functions, thereby exploiting the representational power of parametrized quantum circuits. In the hybrid model we combine classical KAN components with quantum subroutines, while the fully quantum version the entire architecture of the residual function is translated to a quantum model. We demonstrate the feasibility, interpretability and performance of the proposed Quantum KAN (QuKAN) architecture.
中文: 量子Kolmogorov Arnold网络(QuKAN)通过量子电路实现了混合和全量子架构,探索其在量子机器学习中的潜力,并展示了可行性、可解释性和性能。
English: Quantum Kolmogorov Arnold Networks (QuKAN) implement both hybrid and fully quantum architectures using quantum circuits to explore their potential in quantum machine learning, demonstrating feasibility, interpretability, and performance.
Authors:Matthias Tschöpe, Vitor Fortes Rey, Sogo Pierre Sanon, Paul Lukowicz, Nikolaos Palaiodimopoulos, Maximilian Kiefer-Emmanouilidis
Abstract:
Understanding the impact of small quantum gate perturbations, which are common in quantum digital devices but absent in classical computers, is crucial for identifying potential advantages in quantum machine learning. While these perturbations are typically seen as detrimental to quantum computation, they can actually enhance performance by serving as a natural source of data augmentation. Additionally, they can often be efficiently simulated on classical hardware, enabling quantum-inspired approaches to improve classical machine learning methods. In this paper, we investigate random Bloch sphere rotations, which are fundamental SU(2) transformations, as a simple yet effective quantum-inspired data augmentation technique. Unlike conventional augmentations such as flipping, rotating, or cropping, quantum transformations lack intuitive spatial interpretations, making their application to tasks like image classification less straightforward. While common quantum augmentation methods rely on applying quantum models or trainable quanvolutional layers to classical datasets, we focus on the direct application of small-angle Bloch rotations and their effect on classical data. Using the large-scale ImageNet dataset, we demonstrate that our quantum-inspired augmentation method improves image classification performance, increasing Top-1 accuracy by 3%, Top-5 accuracy by 2.5%, and the F$_1$ score from 8% to 12% compared to standard classical augmentation methods. Finally, we examine the use of stronger unitary augmentations. Although these transformations preserve information in principle, they result in visually unrecognizable images with potential applications for privacy computations. However, we show that our augmentation approach and simple SU(2) transformations do not enhance differential privacy and discuss the implications of this limitation.
中文: 量子门扰动虽常被视为不利因素,但可作为数据增强提升机器学习性能,通过布洛赫球旋转的量子启发方法在ImageNet数据集上实现了比传统增强更高的图像分类准确率。
English: Quantum gate perturbations, often viewed as detrimental, can enhance machine learning performance by acting as data augmentation and are efficiently simulated on classical hardware, with Bloch sphere rotations improving image classification accuracy on ImageNet compared to classical methods.
Authors:Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi
Abstract:
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
中文: 该基于MAPE-K的框架利用智能代理AI自主检测和修复微服务异常,提供可定制方案以增强系统稳定性、安全性及性能。
English: The proposed MAPE-K-based framework utilizes agentic AI to autonomously detect and remediate anomalies in microservices, offering customizable solutions for enhancing system stability, security, and performance.
Authors:Li Zhou, Hao Jiang, Junjie Li, Zefeng Zhao, Feng Jiang, Wenyu Chen, Haizhou Li
Abstract:
Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, exhibit a surprising ability in structure-aware tasks. Motivated by these findings, this paper introduces a comprehensive probing framework from an information-theoretic perspective. The framework is designed to systematically assess the role of explicit structural modeling in enhancing language model (LM) representations and to investigate the potential of MLPs as efficient and scalable alternatives to GNNs. We extend traditional probing classifiers by incorporating a control module that allows for selective use of either the full GNN model or its decoupled components, specifically, the message-passing and feature-transformation operations.This modular approach isolates and assesses the individual contributions of these operations, avoiding confounding effects from the complete GNN architecture. Using the Edge Probing Suite, a diagnostic tool for evaluating the linguistic knowledge encoded in LMs, we find that MLPs, when used as feature-transformation modules, consistently improve the linguistic knowledge captured in LM representations across different architectures. They effectively encode both syntactic and semantic patterns. Similarly, GNNs that incorporate feature-transformation operations show beneficial effects. In contrast, models that rely solely on message-passing operations tend to underperform, often leading to negative impacts on probing task performance.
中文: 图神经网络虽能编码结构信息却未充分利用,而多层感知机在结构感知任务中表现优异;研究表明多层感知机可增强语言模型的语言学知识,而仅依赖消息传递的模型则效果不佳。
English: Graph Neural Networks (GNNs) encode structural information but underutilize it, whereas Multi-Layer Perceptrons (MLPs) surprisingly excel in structure-aware tasks, leading to a framework that shows MLPs enhance linguistic knowledge in language models while message-passing alone often harms performance.
Authors:Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, Qiang Li
Abstract:
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
中文:SAM4D是一种支持相机与激光雷达数据提示分割的基础模型,通过统一多模态位置编码实现跨模态对齐,结合运动感知记忆注意力增强时序一致性,其自动化数据引擎能快速生成高质量伪标签。
English: SAM4D is a promptable segmentation foundation model for camera and LiDAR data, utilizing UMPE for cross-modal alignment and MCMA for temporal consistency, with an automated data engine enabling rapid pseudo-label generation.
Authors:MatÃas Mattamala, Nived Chebrolu, Jonas Frey, Leonard FreiÃmuth, Haedam Oh, Benoit Casseau, Marco Hutter, Maurice Fallon
Abstract:
Legged robots are increasingly being adopted in industries such as oil, gas, mining, nuclear, and agriculture. However, new challenges exist when moving into natural, less-structured environments, such as forestry applications. This paper presents a prototype system for autonomous, under-canopy forest inventory with legged platforms. Motivated by the robustness and mobility of modern legged robots, we introduce a system architecture which enabled a quadruped platform to autonomously navigate and map forest plots. Our solution involves a complete navigation stack for state estimation, mission planning, and tree detection and trait estimation. We report the performance of the system from trials executed over one and a half years in forests in three European countries. Our results with the ANYmal robot demonstrate that we can survey plots up to 1 ha plot under 30 min, while also identifying trees with typical DBH accuracy of 2cm. The findings of this project are presented as five lessons and challenges. Particularly, we discuss the maturity of hardware development, state estimation limitations, open problems in forest navigation, future avenues for robotic forest inventory, and more general challenges to assess autonomous systems. By sharing these lessons and challenges, we offer insight and new directions for future research on legged robots, navigation systems, and applications in natural environments. Additional videos can be found in https://dynamic.robots.ox.ac.uk/projects/legged-robots
中文: 本文提出了一种基于腿式机器人的自主森林勘测系统,在欧洲森林试验中实现了高效测绘和树木识别,同时指出了硬件成熟度和导航技术等关键挑战以供未来研究参考。
English: This paper introduces an autonomous forest inventory system using legged robots, demonstrating successful navigation and tree mapping in European forests with high efficiency and accuracy while highlighting key challenges for future research.
Authors:Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Muhammad Khurram Khan, Meng Han
Abstract:
In recent years, Large-Language-Model-driven AI agents have exhibited unprecedented intelligence and adaptability, and are rapidly changing human production and life. Nowadays, agents are undergoing a new round of evolution. They no longer act as an isolated island like LLMs. Instead, they start to communicate with diverse external entities, such as other agents and tools, to perform more complex tasks collectively. Under this trend, agent communication is regarded as a foundational pillar of the future AI ecosystem, and many organizations have intensively begun to design related communication protocols (e.g., Anthropic's MCP and Google's A2A) within the recent few months. However, this new field exposes significant security hazards, which can cause severe damage to real-world scenarios. To help researchers quickly figure out this promising topic and benefit the future agent communication development, this paper presents a comprehensive survey of agent communication security. More precisely, we first present a clear definition of agent communication and categorize the entire lifecycle of agent communication into three stages: user-agent interaction, agent-agent communication, and agent-environment communication. Next, for each communication phase, we dissect related protocols and analyze the security risks according to the communication characteristics. Then, we summarize and outlook on the possible defense countermeasures for each risk. In addition, we conduct experiments using MCP and A2A to help readers better understand the novel vulnerabilities brought by agent communication. Finally, we discuss open issues and future directions in this promising research field.
中文:大型语言模型驱动的AI智能体正通过与外部实体通信来执行复杂任务,但引发了严重的安全隐患,本文通过分析三个阶段通信协议、风险及防御措施,对该领域进行全面综述。
English: Large-Language-Model-driven AI agents are evolving to communicate with external entities, enabling complex tasks but introducing severe security risks, which this paper surveys by analyzing protocols, vulnerabilities, and defenses across three communication stages.
Authors:Yixuan Wang, Ziming Liu, Zongyi Li, Anima Anandkumar, Thomas Y. Hou
Abstract:
We investigate the high-precision training of Physics-Informed Neural Networks (PINNs) in unbounded domains, with a special focus on applications to singularity formulation in PDEs. We propose a modularized approach and study the choices of neural network ansatz, sampling strategy, and optimization algorithm. When combined with rigorous computer-assisted proofs and PDE analysis, the numerical solutions identified by PINNs, provided they are of high precision, can serve as a powerful tool for studying singularities in PDEs. For 1D Burgers equation, our framework can lead to a solution with very high precision, and for the 2D Boussinesq equation, which is directly related to the singularity formulation in 3D Euler and Navier-Stokes equations, we obtain a solution whose loss is $4$ digits smaller than that obtained in \cite{wang2023asymptotic} with fewer training steps. We also discuss potential directions for pushing towards machine precision for higher-dimensional problems.
中文: 本研究通过模块化方法改进了无界域中物理信息神经网络的高精度训练,结合网络结构设计、采样策略和优化算法,成功应用于偏微分方程奇异性分析,在一维Burgers方程和二维Boussinesq方程中均取得了优于现有研究的精度表现。
English: This study advances high-precision Physics-Informed Neural Networks (PINNs) for unbounded domains by introducing a modularized approach that integrates neural network design, sampling strategies, and optimization algorithms, enabling rigorous analysis of PDE singularities and achieving superior accuracy in both 1D Burgers and 2D Boussinesq equations compared to prior work.
Authors:Daniel M. Lang, Richard Osuala, Veronika Spieker, Karim Lekadir, Rickmer Braren, Julia A. Schnabel
Abstract:
Synthetic contrast enhancement offers fast image acquisition and eliminates the need for intravenous injection of contrast agent. This is particularly beneficial for breast imaging, where long acquisition times and high cost are significantly limiting the applicability of magnetic resonance imaging (MRI) as a widespread screening modality. Recent studies have demonstrated the feasibility of synthetic contrast generation. However, current state-of-the-art (SOTA) methods lack sufficient measures for consistent temporal evolution. Neural cellular automata (NCA) offer a robust and lightweight architecture to model evolving patterns between neighboring cells or pixels. In this work we introduce TeNCA (Temporal Neural Cellular Automata), which extends and further refines NCAs to effectively model temporally sparse, non-uniformly sampled imaging data. To achieve this, we advance the training strategy by enabling adaptive loss computation and define the iterative nature of the method to resemble a physical progression in time. This conditions the model to learn a physiologically plausible evolution of contrast enhancement. We rigorously train and test TeNCA on a diverse breast MRI dataset and demonstrate its effectiveness, surpassing the performance of existing methods in generation of images that align with ground truth post-contrast sequences.
Chinese: TeNCA通过模拟生理上合理的对比度演变,实现了快速、无需注射的MRI成像,在生成与真实对比后序列一致的乳腺筛查图像方面超越了现有方法。
English: Synthetic contrast enhancement using TeNCA enables fast, injection-free MRI imaging by modeling physiologically plausible contrast evolution, outperforming current methods in generating accurate post-contrast sequences for breast screening.
Authors:Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John C. S. Lui, Jinhang Zuo
Abstract:
Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction. We further introduce budget-aware and positionally-aware (favoring early-stage satisfaction) extensions to accommodate variable query costs and user preferences for early high-quality responses. Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training. Experiments on diverse benchmarks demonstrate that our methods outperform existing LLM routing strategies in both accuracy and cost-efficiency, validating the power of contextual bandits for real-time, adaptive LLM selection.
中文: 本文提出了一种上下文赌博机框架,用于在实时交互中动态选择大语言模型,无需离线数据即可实现次线性遗憾,并在准确性和成本效益上优于现有方法。
English: This paper introduces a contextual bandit framework for dynamically selecting large language models in real-time interactions, achieving sublinear regret without requiring offline data while outperforming existing methods in accuracy and cost-efficiency.
Authors:Rohan Thakker, Adarsh Patnaik, Vince Kurtz, Jonas Frey, Jonathan Becktor, Sangwoo Moon, Rob Royce, Marcel Kaufmann, Georgios Georgakis, Pascal Roth, Joel Burdick, Marco Hutter, Shehryar Khattak
Abstract:
Safe, reliable navigation in extreme, unfamiliar terrain is required for future robotic space exploration missions. Recent generative-AI methods learn semantically aware navigation policies from large, cross-embodiment datasets, but offer limited safety guarantees. Inspired by human cognitive science, we propose a risk-guided diffusion framework that fuses a fast, learned "System-1" with a slow, physics-based "System-2", sharing computation at both training and inference to couple adaptability with formal safety. Hardware experiments conducted at the NASA JPL's Mars-analog facility, Mars Yard, show that our approach reduces failure rates by up to $4\times$ while matching the goal-reaching performance of learning-based robotic models by leveraging inference-time compute without any additional training.
中文:我们提出的风险引导扩散框架融合了快速的“系统1”学习与慢速的“系统2”物理模型,在火星模拟测试中将故障率降低高达四倍,同时保持目标达成能力,显著提升了机器人导航的安全性与适应性。
English: Our risk-guided diffusion framework combines a fast, learned "System-1" with a slow, physics-based "System-2" to enhance robotic navigation safety and adaptability, reducing failure rates by up to four times in Mars-analog tests while maintaining goal-reaching performance.
Authors:Guian Fang, Yuchao Gu, Mike Zheng Shou
Abstract:
Generating controllable character animation from a reference image and motion guidance remains a challenging task due to the inherent difficulty of injecting appearance and motion cues into video diffusion models. Prior works often rely on complex architectures, explicit guider modules, or multi-stage processing pipelines, which increase structural overhead and hinder deployment. Inspired by the strong visual context modeling capacity of pre-trained video diffusion transformers, we propose FramePrompt, a minimalist yet powerful framework that treats reference images, skeleton-guided motion, and target video clips as a unified visual sequence. By reformulating animation as a conditional future prediction task, we bypass the need for guider networks and structural modifications. Experiments demonstrate that our method significantly outperforms representative baselines across various evaluation metrics while also simplifying training. Our findings highlight the effectiveness of sequence-level visual conditioning and demonstrate the potential of pre-trained models for controllable animation without architectural changes.
中文摘要:FramePrompt 是一个极简而强大的框架,通过将参考图像、骨骼引导动作和目标视频统一为视觉序列,利用预训练视频扩散变换器的能力,无需结构修改即可实现可控角色动画,并在实验中显著优于现有方法。
English Summary: FramePrompt is a minimalist framework that unifies reference images, motion guidance, and target videos into a visual sequence, enabling controllable character animation without architectural modifications by leveraging pre-trained video diffusion transformers.
Authors:Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang, Gaoqi He, Jiao Xie, Jie Hu, Xinghao Chen, Shaohui Lin
Abstract:
Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.
中文摘要:RealSR-R1通过引入VLCoT框架和GRPO优化,结合视觉与语言推理及奖励机制,有效提升了真实世界图像超分辨率的重建质量,生成细节更自然、内容更准确的高分辨率图像。
English Summary: RealSR-R1 introduces the VLCoT framework with GRPO optimization to enhance real-world image super-resolution by integrating vision-language reasoning and reward mechanisms for generating high-fidelity, natural-looking images.
Authors:Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
Abstract:
Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.
中文摘要:ControlVLA通过零初始化投影层将预训练视觉语言动作模型与物体中心表征相连接,仅需10-20次演示即可在现实操作任务中实现76.7%的成功率,显著优于传统方法。
English summary: ControlVLA is a novel framework that bridges pre-trained vision-language-action models with object-centric representations through zero-initialized projection layers, achieving 76.7% success rate in real-world manipulation tasks with only 10-20 demonstrations.
Authors:Jiancheng Ruan, Tingyang Chen, Renchi Yang, Xiangyu Ke, Yunjun Gao
Abstract:
Approximate Nearest Neighbor Search (ANNS) in high-dimensional spaces finds extensive applications in databases, information retrieval, recommender systems, etc. While graph-based methods have emerged as the leading solution for ANNS due to their superior query performance, they still face several challenges, such as struggling with local optima and redundant computations. These issues arise because existing methods (i) fail to fully exploit the topological information underlying the proximity graph G, and (ii) suffer from severe distribution mismatches between the base data and queries in practice.
To this end, this paper proposes GATE, high-tier proximity Graph with Adaptive Topology and Query AwarEness, as a lightweight and adaptive module atop the graph-based indexes to accelerate ANNS. Specifically, GATE formulates the critical problem to identify an optimal entry point in the proximity graph for a given query, facilitating faster online search. By leveraging the inherent clusterability of high-dimensional data, GATE first extracts a small set of hub nodes V as candidate entry points. Then, resorting to a contrastive learning-based two-tower model, GATE encodes both the structural semantics underlying G and the query-relevant features into the latent representations of these hub nodes V. A navigation graph index on V is further constructed to minimize the model inference overhead. Extensive experiments demonstrate that GATE achieves a 1.2-2.0X speed-up in query performance compared to state-of-the-art graph-based indexes.
Chinese: 本文提出GATE轻量模块,通过对比学习模型自适应选择最优入口点来增强基于图的近似最近邻搜索,实现了高达2.0倍的查询性能提升。
English: This paper introduces GATE, a lightweight module that enhances graph-based approximate nearest neighbor search by adaptively selecting optimal entry points using a contrastive learning model, achieving up to 2.0x faster query performance.
Authors:Yi Liu, Hongji Zhang, Yunhao Zhou, Zhengyuan Shi, Changran Xu, Qiang Xu
Abstract:
The integration of large language models (LLMs) into electronic design automation (EDA) has significantly advanced the field, offering transformative benefits, particularly in register transfer level (RTL) code generation and understanding. While previous studies have demonstrated the efficacy of fine-tuning LLMs for these generation-based tasks, embedding-based tasks, which are equally critical to EDA workflows, have been largely overlooked. These tasks, including natural language code search, RTL code functionality equivalence checking, and performance prediction, are essential for accelerating and optimizing the hardware design process. To address this gap, we present DeepRTL2, a family of versatile LLMs that unifies both generation- and embedding-based tasks related to RTL. By simultaneously tackling a broad range of tasks, DeepRTL2 represents the first model to provide a comprehensive solution to the diverse challenges in EDA. Through extensive experiments, we show that DeepRTL2 achieves state-of-the-art performance across all evaluated tasks.
中文: DeepRTL2推出了首个统一处理生成与嵌入任务的大语言模型系列,在电子设计自动化领域全面应对各类RTL相关挑战,并在所有评估任务中实现了最优性能。
English: DeepRTL2 introduces a unified family of large language models that address both generation- and embedding-based tasks in electronic design automation, achieving state-of-the-art performance across diverse RTL-related challenges.
Authors:Adrian Poniatowski, Natalie Gentner, Manuel Barusco, Davide Dalle Pezze, Samuele Salti, Gian Antonio Susto
Abstract:
In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.
在半导体领域,领域自适应技术通过利用深度学习减少人工标注和训练需求,提升了缺陷分类的效率和可扩展性。
In the semiconductor industry, domain adaptation techniques enhance defect classification by leveraging deep learning to reduce manual labeling and training costs, improving efficiency and scalability.
Authors:Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, P. N. Suganthan, Xudong Jiang, Savitha Ramasamy
Abstract:
Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on \textit{encoder-based} TSFMs, we propose \textbf{M}ulti\textbf{\textsc{s}}cale \textbf{\textsc{f}}ine\textbf{\textsc{t}}uning (\textbf{MSFT}), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (\moirai, \moment\ and \units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods.
中文: 时间序列基础模型在零样本预测中表现优异,但直接微调易导致过拟合,因此提出的多尺度微调框架MSFT能显著提升模型性能,并在多种骨干网络上超越现有最优方法。
English: Time series foundation models show strong zero-shot forecasting abilities, but naive fine-tuning often leads to overfitting, prompting the development of MSFT, a multi-scale fine-tuning framework that significantly enhances performance across various models and outperforms existing methods.
Authors:Yue Xia, Christoph Hofmeister, Maximilian Egger, Rawad Bitar
Abstract:
Federated learning (FL) shows great promise in large-scale machine learning but introduces new privacy and security challenges. We propose ByITFL and LoByITFL, two novel FL schemes that enhance resilience against Byzantine users while keeping the users' data private from eavesdroppers. To ensure privacy and Byzantine resilience, our schemes build on having a small representative dataset available to the federator and crafting a discriminator function allowing the mitigation of corrupt users' contributions. ByITFL employs Lagrange coded computing and re-randomization, making it the first Byzantine-resilient FL scheme with perfect Information-Theoretic (IT) privacy, though at the cost of a significant communication overhead. LoByITFL, on the other hand, achieves Byzantine resilience and IT privacy at a significantly reduced communication cost, but requires a Trusted Third Party, used only in a one-time initialization phase before training. We provide theoretical guarantees on privacy and Byzantine resilience, along with convergence guarantees and experimental results validating our findings.
中文摘要:作者提出了ByITFL和LoByITFL两种联邦学习方案,能够同时实现拜占庭容错和信息论隐私保护,其中ByITFL以较高通信开销为代价实现完美隐私,而LoByITFL通过可信第三方初始化显著降低了通信成本。
English Summary: The authors introduce ByITFL and LoByITFL, two federated learning schemes that provide Byzantine resilience and information-theoretic privacy, with ByITFL achieving perfect privacy at high communication cost while LoByITFL reduces communication overhead but requires a trusted third party for initialization.
Authors:Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
Abstract:
Although Foundation Models (FMs), such as GPT-4, are increasingly used in domains like finance and software engineering, reliance on textual interfaces limits these models' real-world interaction. To address this, FM providers introduced tool calling-triggering a proliferation of frameworks with distinct tool interfaces. In late 2024, Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem, which has become the de facto standard with over eight million weekly SDK downloads. Despite its adoption, MCP's AI-driven, non-deterministic control flow introduces new risks to sustainability, security, and maintainability, warranting closer examination.
Towards this end, we present the first large-scale empirical study of MCP servers. Using state-of-the-art health metrics and a hybrid analysis pipeline, combining a general-purpose static analysis tool with an MCP-specific scanner, we evaluate 1,899 open-source MCP servers to assess their health, security, and maintainability. Despite MCP servers demonstrating strong health metrics, we identify eight distinct vulnerabilities - only three overlapping with traditional software vulnerabilities. Additionally, 7.2% of servers contain general vulnerabilities and 5.5% exhibit MCP-specific tool poisoning. Regarding maintainability, while 66% exhibit code smells, 14.4% contain nine bug patterns overlapping with traditional open-source software projects. These findings highlight the need for MCP-specific vulnerability detection techniques while reaffirming the value of traditional analysis and refactoring practices.
中文摘要:基础模型对文本界面的依赖催生了模型上下文协议(MCP)的标准化,但首项针对1,899个MCP服务器的大规模研究表明,其人工智能驱动的控制流在保持较高健康度的同时,既存在传统代码缺陷,更暴露出仅三成与传统漏洞重叠的八类新型安全风险。
English Summary: Foundation Models' reliance on textual interfaces led to the Model Context Protocol (MCP) standardization, but its AI-driven control flow introduces unique vulnerabilities, as revealed by the first large-scale study of 1,899 MCP servers showing distinct security risks alongside traditional code issues.
Authors:Hu Yu, Hao Luo, Fan Wang, Feng Zhao
Abstract:
Diffusion probabilistic models (DPMs) have achieved impressive success in visual generation. While, they suffer from slow inference speed due to iterative sampling. Employing fewer sampling steps is an intuitive solution, but this will also introduces discretization error. Existing fast samplers make inspiring efforts to reduce discretization error through the adoption of high-order solvers, potentially reaching a plateau in terms of optimization. This raises the question: can the sampling process be accelerated further? In this paper, we re-examine the nature of sampling errors, discerning that they comprise two distinct elements: the widely recognized discretization error and the less explored approximation error. Our research elucidates the dynamics between these errors and the step by implementing a dual-error disentanglement strategy. Building on these foundations, we introduce an unified and training-free acceleration framework, DualFast, designed to enhance the speed of DPM sampling by concurrently accounting for both error types, thereby minimizing the total sampling error. DualFast is seamlessly compatible with existing samplers and significantly boost their sampling quality and speed, particularly in extremely few sampling steps. We substantiate the effectiveness of our framework through comprehensive experiments, spanning both unconditional and conditional sampling domains, across both pixel-space and latent-space DPMs.
Chinese Summary: 扩散概率模型(DPMs)因迭代采样导致推理速度慢,本文提出无需训练的统一加速框架DualFast,通过同时处理离散化误差和近似误差来最小化总采样误差,显著提升现有采样器在极少数采样步骤下的性能与速度。
English Summary: Diffusion probabilistic models (DPMs) face slow inference due to iterative sampling, and this paper introduces DualFast, a training-free framework that accelerates DPMs by addressing both discretization and approximation errors to minimize total sampling error while maintaining compatibility with existing samplers.
Authors:Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Abstract:
With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.
中文: PersonaFeedback基准通过直接评估模型根据明确用户角色生成个性化回复的能力,解决了LLM个性化领域缺乏评估标准的问题,研究表明即使最先进的模型在需要区分细微差异的复杂个性化任务中仍表现不足。
English: The PersonaFeedback benchmark addresses the lack of evaluation standards for LLM personalization by directly testing models' ability to generate responses tailored to explicit user personas, revealing that even top-performing models struggle with complex personalization tasks where nuanced distinctions are required.
Authors:Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara
Abstract:
Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both low and high overlap ratios, and diarization performance on multiple datasets.
中文摘要:研究发现基于全局统计的模块在低重叠场景下因对非目标说话人片段过度敏感而导致性能下降,提出利用目标说话人活动线索仅在其活跃区间计算统计量的方法,从而在各种重叠条件下提升了说话人验证和语音分离的性能。
English Summary: The study identifies that global-statistics-based modules in speaker embedding extractors cause performance degradation in low-overlap scenarios by being overly sensitive to non-target speaker intervals, and proposes a method using target speaker activity clues to compute statistics only during active target speech, improving verification and diarization performance across overlap conditions.
Authors:Xuchuang Wang, Maoli Liu, Xutong Liu, Zhuohua Li, Mohammad Hajiesmaili, John C. S. Lui, Don Towsley
Abstract:
Quantum networks (QNs) transmit delicate quantum information across noisy quantum channels. Crucial applications, like quantum key distribution (QKD) and distributed quantum computation (DQC), rely on efficient quantum information transmission. Learning the best path between a pair of end nodes in a QN is key to enhancing such applications. This paper addresses learning the best path in a QN in the online learning setting. We explore two types of feedback: "link-level" and "path-level". Link-level feedback pertains to QNs with advanced quantum switches that enable link-level benchmarking. Path-level feedback, on the other hand, is associated with basic quantum switches that permit only path-level benchmarking. We introduce two online learning algorithms, BeQuP-Link and BeQuP-Path, to identify the best path using link-level and path-level feedback, respectively. To learn the best path, BeQuP-Link benchmarks the critical links dynamically, while BeQuP-Path relies on a subroutine, transferring path-level observations to estimate link-level parameters in a batch manner. We analyze the quantum resource complexity of these algorithms and demonstrate that both can efficiently and, with high probability, determine the best path. Finally, we perform NetSquid-based simulations and validate that both algorithms accurately and efficiently identify the best path.
Chinese: 本文提出了BeQuP-Link和BeQuP-Path两种在线学习算法,分别利用链路级和路径级反馈来高效确定量子网络中的最优路径,从而提升量子密钥分发和分布式量子计算等关键应用的性能。
English: This paper introduces two online learning algorithms, BeQuP-Link and BeQuP-Path, which efficiently identify the best path in quantum networks using link-level and path-level feedback respectively, enhancing applications like quantum key distribution and distributed quantum computation.
Authors:Jiayu Yao, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Yuyao Ge, Zhecheng Li, Xueqi Cheng
Abstract:
Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.
中文摘要:本研究首次系统揭示多模态RAG系统存在显著位置偏见,表现为证据排序导致的U型准确率曲线,且多模态交互会加剧该偏见,亟需通过证据重排或去偏策略提升系统可靠性。
English Summary: This study reveals that multimodal RAG systems exhibit significant position bias where performance follows a U-shaped accuracy curve based on evidence order, intensifying with modality interactions and expanding retrieval scope.
Authors:Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer
Abstract:
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.
中文: 研究表明,即使使用与正确答案无关的虚假奖励,可验证奖励的强化学习(RLVR)也能显著提升如Qwen2.5-Math-7B等模型的数学推理能力,但其效果因模型架构而异,且可能依赖于激发预训练中的潜在推理能力。
English: Reinforcement learning with verifiable rewards (RLVR) can significantly enhance mathematical reasoning in models like Qwen2.5-Math-7B even with spurious rewards, though its effectiveness varies across model families and may rely on surfacing pretrained reasoning representations.
Authors:Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, Romain Serizel, Mayank Mishra, Marc Delcroix, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Tomohiro Nakatani, Takao Kawamura, Nobutaka Ono
Abstract:
Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. However, because several existing challenge tasks already provide some of the subset functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals. This paper outlines the S5 task setting of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 4 and the DCASE2025 Task 4 Dataset, newly recorded and curated for this task. We also report experimental results for an S5 system trained and evaluated on this dataset. The full version of this paper will be published after the challenge results are made public.
中文: DCASE 2025挑战赛任务4的S5任务旨在通过新录制的数据集和系统实验,从多通道空间信号中检测并分离声音事件。
English: The S5 task for DCASE 2025 Challenge Task 4 focuses on detecting and separating sound events from multi-channel spatial signals, utilizing a newly recorded dataset and experimental system evaluations.
Authors:Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Augusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Abstract:
This paper introduces the task description for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 2, titled "First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring." Building on the DCASE 2024 Challenge Task 2, this task is structured as a first-shot problem within a domain generalization framework. The primary objective of the first-shot approach is to facilitate the rapid deployment of ASD systems for new machine types without requiring machine-specific hyperparameter tunings. For DCASE 2025 Challenge Task 2, sounds from previously unseen machine types have been collected and provided as the evaluation dataset. Results and analysis of the challenge submissions will be added following the challenge's submission deadline.
中文:DCASE 2025挑战赛任务2致力于首次样本无监督异常声音检测,无需超参数调优即可快速部署新型机器监测系统,对119份参赛方案的分析表明,结合适当技术时多种方法均具竞争力。
English: The DCASE 2025 Challenge Task 2 focuses on first-shot unsupervised anomalous sound detection to enable rapid deployment for new machine types without hyperparameter tuning, with analysis of 119 submissions revealing competitive performance across diverse methods when combined with proper techniques.
Authors:Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Augusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Abstract:
This paper introduces the task description for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 2, titled "First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring". Building on the DCASE 2024 Challenge Task 2, this task is structured as a first-shot problem within a domain generalization framework. The primary objective of the first-shot approach is to facilitate the rapid deployment of ASD systems for new machine types without requiring machine-specific hyperparameter tunings. For DCASE 2025 Challenge Task 2, sounds from previously unseen machine types have been collected and provided as the evaluation dataset. We received 119 submissions from 35 teams, and an analysis of these submissions has been made in this paper. Analysis showed that various approaches can all be competitive, such as fine-tuning pre-trained models, using frozen pre-trained models, and training small models from scratch, when combined with appropriate cost functions, anomaly score normalization, and use of clean machine and noise sounds.
中文:DCASE 2025挑战赛任务2致力于首次样本无监督异常声音检测,无需超参数调优即可快速部署新型机器监测系统,对119份参赛方案的分析表明,结合适当技术时多种方法均具竞争力。
English: The DCASE 2025 Challenge Task 2 focuses on first-shot unsupervised anomalous sound detection to enable rapid deployment for new machine types without hyperparameter tuning, with analysis of 119 submissions revealing competitive performance across diverse methods when combined with proper techniques.
Authors:Maximilian Egger, Rawad Bitar
Abstract:
Ensuring resilience to Byzantine clients while maintaining the privacy of the clients' data is a fundamental challenge in federated learning (FL). When the clients' data is homogeneous, suitable countermeasures were studied from an information-theoretic perspective utilizing secure aggregation techniques while ensuring robust aggregation of the clients' gradients. However, the countermeasures used fail when the clients' data is heterogeneous. Suitable pre-processing techniques, such as nearest neighbor mixing, were recently shown to enhance the performance of those countermeasures in the heterogeneous setting. Nevertheless, those pre-processing techniques cannot be applied with the introduced privacy-preserving mechanisms.
We propose a multi-stage method encompassing a careful co-design of verifiable secret sharing, secure aggregation, and a tailored symmetric private information retrieval scheme to achieve information-theoretic privacy guarantees and Byzantine resilience under data heterogeneity. We evaluate the effectiveness of our scheme on a variety of attacks and show how it outperforms the previously known techniques. Since the communication overhead of secure aggregation is non-negligible, we investigate the interplay with zero-order estimation methods that reduce the communication cost in state-of-the-art FL tasks and thereby make private aggregation scalable.
中文: 本研究提出一种多阶段方法,结合可验证秘密共享、安全聚合与定制化对称私有信息检索方案,在联邦学习的异构数据场景中同时实现拜占庭容错与信息论隐私保护,其性能优于现有技术,并通过零阶估计方法优化通信效率。
English: This study introduces a multi-stage approach combining verifiable secret sharing, secure aggregation, and a customized symmetric private information retrieval scheme to achieve both Byzantine resilience and information-theoretic privacy in federated learning with heterogeneous data, outperforming prior methods while addressing communication efficiency through zero-order estimation techniques.
Authors:Xinyi Gao, Qiucheng Wu, Yang Zhang, Xuechen Liu, Kaizhi Qian, Ying Xu, Shiyu Chang
Abstract:
Knowledge tracing (KT) aims to estimate a student's evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students' exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing (KT$^2$), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. KT$^2$ estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that KT$^2$ consistently outperforms strong baselines in realistic online, low-resource settings.
Chinese Summary: 基于知识树的知识追踪(KT$^2$)是一个概率框架,利用层次化知识概念在低资源教育场景中提升学生表现预测能力,通过基于EM的掌握度估计和增量更新机制持续优于现有方法。
English Summary: Knowledge-Tree-based Knowledge Tracing (KT$^2$) is a probabilistic framework that uses hierarchical knowledge concepts to enhance student performance prediction in low-resource educational settings, consistently outperforming existing methods through its EM-based mastery estimation and incremental update mechanism.
Authors:Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Abstract:
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
中文摘要:VIKI-Bench是首个面向具身多智能体协作的分层基准测试平台,而VIKI-R作为两阶段框架,通过结合微调的视觉语言模型与强化学习,在异构智能体间实现了组合式协作模式,显著优于现有基线方法。
English Summary: VIKI-Bench is the first hierarchical benchmark for embodied multi-agent cooperation, while VIKI-R is a two-stage framework that significantly outperforms baselines by combining fine-tuned vision-language models with reinforcement learning to enable compositional cooperation among heterogeneous agents.
Authors:Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, Siyuan Huang
Abstract:
Humanoid teleoperation plays a vital role in demonstrating and collecting data for complex humanoid-scene interactions. However, current teleoperation systems face critical limitations: they decouple upper- and lower-body control to maintain stability, restricting natural coordination, and operate open-loop without real-time position feedback, leading to accumulated drift. The fundamental challenge is achieving precise, coordinated whole-body teleoperation over extended durations while maintaining accurate global positioning. Here we show that an MoE-based teleoperation system, CLONE, with closed-loop error correction enables unprecedented whole-body teleoperation fidelity, maintaining minimal positional drift over long-range trajectories using only head and hand tracking from an MR headset. Unlike previous methods that either sacrifice coordination for stability or suffer from unbounded drift, CLONE learns diverse motion skills while preventing tracking error accumulation through real-time feedback, enabling complex coordinated movements such as ``picking up objects from the ground.'' These results establish a new milestone for whole-body humanoid teleoperation for long-horizon humanoid-scene interaction tasks.
中文: CLONE系统通过闭环误差校正实现了前所未有的全身仿人遥操作,仅利用MR头显追踪即可完成复杂协调动作且保持极低的位置漂移。
English: The CLONE system enables unprecedented whole-body humanoid teleoperation with minimal drift through closed-loop error correction, allowing complex coordinated movements using only MR headset tracking.
Authors:Haoting Wang, Jianling Wang, Hao Li, Fangjun Yi, Mengyu Fu, Youwei Zhang, Yifan Liu, Liang Liu, Minmin Chen, Ed H. Chi, Lichan Hong, Haokai Lu
Abstract:
Conventional recommendation systems succeed in identifying relevant content but often fail to provide users with surprising or novel items. Multimodal Large Language Models (MLLMs) possess the world knowledge and multimodal understanding needed for serendipity, but their integration into billion-item-scale platforms presents significant challenges. In this paper, we propose a novel hierarchical framework where fine-tuned MLLMs provide high-level guidance to conventional recommendation models, steering them towards more serendipitous suggestions. This approach leverages MLLM strengths in understanding multimodal content and user interests while retaining the efficiency of traditional models for item-level recommendation. This mitigates the complexity of applying MLLMs directly to vast action spaces. We also demonstrate a chain-of-thought strategy enabling MLLMs to discover novel user interests by first understanding video content and then identifying relevant yet unexplored interest clusters. Through live experiments within a commercial short-form video platform serving billions of users, we show that our MLLM-powered approach significantly improves both recommendation serendipity and user satisfaction.
中文摘要:本文提出一种分层框架,通过微调的多模态大语言模型为传统推荐系统提供高层指导,在保持传统模型效率的同时利用MLLM的多模态理解能力,显著提升了推荐意外性和用户满意度。
English Summary: This paper introduces a hierarchical framework that integrates fine-tuned Multimodal Large Language Models with conventional recommendation systems to enhance serendipity and user satisfaction by leveraging MLLMs' multimodal understanding while maintaining traditional models' efficiency.
Authors:Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian DuSell, Benjamin LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O'Donnell, Ryan Cotterell
Abstract:
Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
中文: 现代语言模型使用确定性分词器时,会将概率错误分配给大量训练数据中从未出现的非规范标记串,本研究提出了两种强制规范性的方法,有效提升了模型在未见过数据上的表现。
English: Modern language models using deterministic tokenizers incorrectly allocate probability to numerous noncanonical token strings that never appear in training data, and this work proposes two methods to enforce canonicality, improving model performance on held-out data.
Authors:Alexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi
Abstract:
Over the past decade, the wide adoption of Microservice Architecture has required the identification of various patterns and anti-patterns to prevent Microservice Architectural Degradation. Frequently, the systems are modelled as a network of connected services. Recently, the study of temporal networks has emerged as a way to describe and analyze evolving networks. Previous research has explored how software metrics such as size, complexity, and quality are related to microservice centrality in the architectural network. This study investigates whether temporal centrality metrics can provide insight into the early detection of architectural degradation by correlating or affecting software metrics. We reconstructed the architecture of 7 releases of an OSS microservice project with 42 services. For every service in every release, we computed the software and centrality metrics. From one of the latter, we derived a new metric, Centrality Change Proneness. We then explored the correlation between the metrics. We identified 7 size and 5 complexity metrics that have a consistent correlation with centrality, while Centrality Change Proneness did not affect the software metrics, thus providing yet another perspective and an early indicator of microservice architectural degradation.
中文摘要:本研究通过分析一个开源项目七个版本中软件指标与中心性指标的相关性,探讨了时序中心性指标如何作为微服务架构退化的早期预警信号,发现其与规模和复杂度指标存在稳定关联。
English Summary: This study explores how temporal centrality metrics can serve as early indicators of microservice architectural degradation by analyzing correlations with software metrics across seven releases of an open-source project, identifying consistent relationships with size and complexity metrics.
Authors:Alexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi
Abstract:
Context: Microservice Architecture is a popular architectural paradigm that facilitates flexibility by decomposing applications into small, independently deployable services. Catalogs of architectural anti-patterns have been proposed to highlight the negative aspects of flawed microservice design. In particular, the Hub-like anti-pattern lacks an unambiguous definition and detection method. Aim: In this work, we aim to find a robust detection approach for the Hub-like microservice anti-pattern that outputs a reasonable number of Hub-like candidates with high precision. Method: We leveraged a dataset of 25 microservice networks and several network hub detection techniques to identify the Hub-like anti-pattern, namely scale-free property, centrality metrics and clustering coefficient, minimum description length principle, and the approach behind the Arcan tool. Results and Conclusion: Our findings revealed that the studied architectural networks are not scale-free, that most considered hub detection approaches do not agree on the detected hubs, and that the method by Kirkley leveraging the Erdos-Renyi encoding is the most accurate one in terms of the number of detected hubs and the detection precision. Investigating further the applicability of these methods to detecting Hub-like components in microservice-based and other systems opens up new research directions. Moreover, our results provide an evaluation of the approach utilized by the widely used Arcan tool and highlight the potential to update the tool to use the normalized degree centrality of a component in the network, or for the approach based on ER encoding to be adopted instead.
中文摘要:本研究确定了检测微服务架构中Hub-like反模式的最准确方法,发现Kirkley基于ER编码的方法在精度和候选数量上最优,同时评估了Arcan工具的现有方法并提出了改进建议。
English Summary: This study identifies the most accurate method for detecting the Hub-like microservice anti-pattern, finding Kirkley's ER encoding approach superior in precision and candidate selection while evaluating and suggesting improvements for the Arcan tool.
Authors:Bowen Liu, Weiyi Zhang, Peranut Chotcomwongse, Xiaolan Chen, Ruoyu Chen, Pawin Pakaymaskul, Niracha Arjkongharn, Nattaporn Vongsa, Xuelian Cheng, Zongyuan Ge, Kun Huang, Xiaohui Li, Yiru Duan, Zhenbang Wang, BaoYe Xie, Qiang Chen, Huazhu Fu, Michael A. Mahr, Jiaqi Qu, Wangyiyang Chen, Shiye Wang, Yubo Tan, Yongjie Li, Mingguang He, Danli Shi, Paisan Ruamviboonsuk
Abstract:
Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.
Chinese: 光学相干断层扫描(OCT)虽能提供高分辨率3D视网膜成像但普及受限,而APTOS-2024挑战赛通过人工智能技术实现了从易获取的2D眼底照片生成3D OCT图像的重大突破,为提升医疗资源匮乏地区的眼科诊疗可及性开辟了新途径。
English: Optical Coherence Tomography (OCT) offers high-resolution 3D retinal imaging but faces adoption barriers, while the APTOS-2024 Challenge successfully advanced AI methods to generate 3D OCT images from more accessible 2D fundus photos, demonstrating potential for enhancing ophthalmic care accessibility.
Authors:Erik Burman, Siyu Cen, Bangti Jin, Zhi Zhou
Abstract:
In this work, we numerically investigate the inverse Robin problem of recovering a piecewise constant Robin coefficient in an elliptic or parabolic problem from the Cauchy data on a part of the boundary, a problem that commonly arises in applications such as non-destructive corrosion detection. We employ a Kohn-Vogelius type variational functional for the regularized reconstruction, and discretize the resulting optimization problem using the Galerkin finite element method on a graded mesh. We establish rigorous error estimates on the recovered Robin coefficient in terms of the mesh size, temporal step size and noise level. This is achieved by combining the approximation error of the direct problem, a priori estimates on the functional, and suitable conditional stability estimates of the continuous inverse problem. We present several numerical experiments to illustrate the approach and to complement the theoretical findings.
中文: 本研究采用Kohn-Vogelius变分方法和有限元离散化,通过边界柯西数据数值求解椭圆/抛物型方程中的分段常数Robin系数反问题,建立了严格误差估计并通过数值实验验证了理论结果。
English: This study numerically solves the inverse Robin problem for identifying piecewise constant coefficients in elliptic/parabolic equations using Cauchy data, employing a Kohn-Vogelius approach with finite element discretization and providing rigorous error estimates validated by numerical experiments.
Authors:Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang
Abstract:
Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
Chinese: 本文提出了一种多模态大语言模型的动态推理框架,通过验证器引导的迭代视觉推理方法,显著超越了静态模型性能,同时提升了准确性和可解释性。
English: This paper introduces a dynamic inference framework for multi-modal large language models that enables iterative, verifier-guided visual reasoning, significantly outperforming static approaches and enhancing both accuracy and interpretability.
Authors:Tianyi Bai, Yuxuan Fan, Jiantao Qiu, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, Binhang Yuan
Abstract:
Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.
中文: 针对多模态大语言模型在细粒度视觉推理上的不足,我们提出了一种可控数据生成流程构建微编辑数据集,并结合监督微调框架,有效提升了差异检测精度、减少了幻觉现象,同时增强了在标准视觉语言任务中的表现。
English: To address multimodal large language models' limitations in fine-grained visual reasoning, we propose a controlled data generation pipeline creating the Micro Edit Dataset and a supervised fine-tuning framework that improves difference detection accuracy and reduces hallucinations while enhancing performance on standard vision-language tasks.
Authors:Vicky Xefteri, Tim Vieira, Ryan Cotterell, Afra Amini
Abstract:
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from $12.31$ (GPT2-large) and $35.33$ (Llama3-8B) to about $93$ in both cases without compromising the language model's fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.
中文: 本文提出一种基于后验推理的采样方法,通过结合序列蒙特卡洛和句法标注器来增强语言模型生成中的句法控制,在保持流畅性的同时,将GPT2和Llama3-8B模型的F1分数显著提升至约93分。
English: This paper introduces a posterior inference-based sampling method that enhances syntactic control in language model generation by integrating sequential Monte Carlo with a syntactic tagger, significantly improving F1 scores to around 93 for GPT2 and Llama3-8B models while maintaining fluency.
Authors:Tianyuan Shi, Canbin Huang, Fanqi Wan, Longguang Zhong, Ziyi Yang, Weizhou Shen, Xiaojun Quan, Ming Yan
Abstract:
During the preference optimization of large language models (LLMs), distribution shifts may arise between newly generated model samples and the data used to train the reward model (RM). This shift reduces the efficacy of the RM, which in turn negatively impacts the performance of the policy model (PM). To address this challenge, we propose Mutual-Taught, a self-training method that iteratively improves both the PM and RM without requiring additional human annotation. Our approach mirrors the expectation-maximization (EM) algorithm. In the E-step, the PM is updated using feedback from the current RM, guiding the PM toward a better approximation of the latent optimal preference distribution. In the M-step, we update the RM by constructing training data from the outputs of the PM before and after the E-step update. This process ensures that the RM adapts to the evolving policy distribution. Experimental results demonstrate that this iterative approach leads to consistent improvements in both models. Specifically, our 8B policy model, LLaMA-3-8B-Instruct-MT, achieves a length-controlled win rate of 54.1\% on AlpacaEval-2, while our 8B reward model, FsfairX-LLaMA3-RM-MT, performs on par with GPT-4o-2024-08-06 on RewardBench.
中文摘要:Mutual-Taught方法通过自训练迭代优化策略模型和奖励模型,有效解决了大语言模型偏好优化中的分布偏移问题,其8B模型在基准测试中展现出与先进模型相当的优异性能。
English Summary: The Mutual-Taught method addresses distribution shifts in LLM preference optimization by iteratively improving both the policy and reward models through a self-training approach, achieving competitive performance with 8B models on benchmark evaluations.
Authors:Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Abstract:
While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.
中文: 该研究表明大语言模型具备一定的时间流逝感知能力,能够将语言符号与物理时间联系起来,并在时间压力下调整自身行为,但这种能力会因模型规模和推理能力而异。
English: This research reveals that large language models possess a certain awareness of time passage, enabling them to connect linguistic tokens with physical time and adjust their responses under time constraints, though this ability depends on model scale and reasoning capacity.
Authors:Shengcao Cao, Zijun Wei, Jason Kuen, Kangning Liu, Lingzhi Zhang, Jiuxiang Gu, HyunJoon Jung, Liang-Yan Gui, Yu-Xiong Wang
Abstract:
Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.
中文摘要:该摘要提出了一种新的全模态指代表达分割(ORES)任务及相应的RAS框架,通过增强分割模型的多模态理解能力,实现基于纯文本或文本加视觉实体的任意提示生成掩码组,并通过新创建的数据集验证了其在多项任务中的优越性能。
English Summary: The abstract introduces a new omnimodal referring expression segmentation (ORES) task and a corresponding RAS framework that enhances segmentation models with multimodal comprehension to generate mask groups from text-only or text-plus-visual prompts, validated through newly created datasets and outperforming existing methods.
Authors:Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen
Abstract:
Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
中文摘要:近期视频扩散变换器在多样化运动生成方面取得突破,但现有运动迁移方法存在运动不一致和调优效率低的问题,本文提出的Follow-Your-Motion框架通过时空解耦LoRA和加速调优技术有效解决了这些挑战。
English Summary: Recent advances in video diffusion transformers have enabled diverse motion generation, but current motion-transfer methods face issues with motion inconsistency and tuning inefficiency, which the proposed Follow-Your-Motion framework addresses through spatial-temporal decoupled LoRA and accelerated tuning techniques.
Authors:Taiga Someya, Anej Svete, Brian DuSell, Timothy J. O'Donnell, Mario Giulianelli, Ryan Cotterell
Abstract:
Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.
中文: 本研究提出了一个量化框架,利用m-局部熵证明神经语言模型与人类相似,在学习具有更高局部不确定性的语言时更为困难,揭示了它们对语言局部统计结构的敏感性。
English: The study introduces a quantitative framework using m-local entropy to demonstrate that neural language models, like humans, struggle more with learning languages possessing higher local uncertainty, revealing their sensitivity to local statistical structures.
Authors:Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath
Abstract:
Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.
中文: Object-X是一个多功能多模态框架,能够编码来自多种输入的丰富物体嵌入并解码为高保真几何重建,在下游任务中实现卓越性能,同时大幅降低存储需求。
English: Object-X is a versatile multi-modal framework that encodes rich object embeddings from various inputs and decodes them into high-fidelity geometric reconstructions, achieving superior performance in downstream tasks with significantly reduced storage requirements.
Authors:Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, Qifeng Chen
Abstract:
We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.
中文: Follow-Your-Creation是一种创新的4D视频框架,通过将视频创作重新定义为修复任务,能够从单目视频生成和编辑内容,在保持多视角一致性和编辑灵活性方面显著优于现有方法。
English: Follow-Your-Creation is a novel 4D video framework that generates and edits content from monocular video by reformulating creation as a video inpainting task, achieving superior multi-view consistency and editing flexibility compared to existing methods.
Authors:Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
Abstract:
Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/
Chinese: 本文提出了一种交互式对象感知音频生成模型,通过结合以对象为中心的学习和条件潜在扩散框架,利用多模态注意力与图像分割技术,使用户能够针对图像中的特定视觉对象交互式地生成对应声音。
English: This paper introduces an interactive object-aware audio generation model that integrates object-centric learning with a conditional latent diffusion framework, enabling users to selectively generate sounds for specific visual objects in images through multi-modal attention and segmentation.
Authors:Wanghao Ye, Sihan Chen, Yiting Wang, Shwai He, Bowei Tian, Guoheng Sun, Ziyi Wang, Ziyao Wang, Yexiao He, Zheyu Shen, Meng Liu, Yuning Zhang, Meng Feng, Yang Wang, Siyuan Peng, Yilong Dai, Zhenle Duan, Hanzhang Qin, Ang Li
Abstract:
Current large language model (LLM) agents lack authentic human psychological processes necessary for genuine digital twins and social AI applications. To address this limitation, we present a computational implementation of Global Workspace Theory (GNWT) that integrates human cognitive architecture principles into LLM agents, creating specialized sub-agents for emotion, memory, social norms, planning, and goal-tracking coordinated through a global workspace mechanism. However, authentic digital twins require accurate personality initialization. We therefore develop a novel adventure-based personality test that evaluates true personality through behavioral choices within interactive scenarios, bypassing self-presentation bias found in traditional assessments. Building on these innovations, our CogniPair platform enables digital twins to engage in realistic simulated dating interactions and job interviews before real encounters, providing bidirectional cultural fit assessment for both romantic compatibility and workplace matching. Validation using 551 GNWT-Agents and Columbia University Speed Dating dataset demonstrates 72% correlation with human attraction patterns, 77.8% match prediction accuracy, and 74% agreement in human validation studies. This work advances psychological authenticity in LLM agents and establishes a foundation for intelligent dating platforms and HR technology solutions.
中文: 本研究通过全局工作空间理论框架为大型语言模型注入人类认知模块,并结合冒险式人格测试创建心理真实的数字孪生,在模拟约会和职场匹配中展现出与人类行为模式高度吻合的验证结果。
English: This work introduces a Global Workspace Theory implementation that enhances LLM agents with human cognitive modules and an adventure-based personality test, enabling realistic digital twins for dating and job matching with validated human-like interaction accuracy.
Authors:Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan
Abstract:
Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.
中文摘要:知识编辑技术虽能免于全模型重训练而更新大语言模型中的特定事实,但现有方法在复杂医学领域仅实现浅层记忆而缺乏泛化能力,因此提出的SGR-Edit通过利用模型自生成原理进行编辑,显著提升了医学知识推理性能。
English Summary: Knowledge editing in large language models shows promise for updating facts without full retraining, but current methods struggle with deep medical knowledge integration and generalization, leading to the development of SGR-Edit which uses model-generated rationales to significantly improve performance.
Authors:Jiajie Fu, Haitong Tang, Arijit Khan, Sharad Mehrotra, Xiangyu Ke, Yunjun Gao
Abstract:
Entity Resolution (ER) is a fundamental data quality improvement task that identifies and links records referring to the same real-world entity. Traditional ER approaches often rely on pairwise comparisons, which can be costly in terms of time and monetary resources, especially with large datasets. Recently, Large Language Models (LLMs) have shown promising results in ER tasks. However, existing methods typically focus on pairwise matching, missing the potential of LLMs to perform clustering directly in a more cost-effective and scalable manner. In this paper, we propose a novel in-context clustering approach for ER, where LLMs are used to cluster records directly, reducing both time complexity and monetary costs. We systematically investigate the design space for in-context clustering, analyzing the impact of factors such as set size, diversity, variation, and ordering of records on clustering performance. Based on these insights, we develop LLM-CER (LLM-powered Clustering-based ER), which achieves high-quality ER results while minimizing LLM API calls. Our approach addresses key challenges, including efficient cluster merging and LLM hallucination, providing a scalable and effective solution for ER. Extensive experiments on nine real-world datasets demonstrate that our method significantly improves result quality, achieving up to 150% higher accuracy, 10% increase in the F-measure, and reducing API calls by up to 5 times, while maintaining comparable monetary cost to the most cost-effective baseline.
中文摘要:本文提出了一种利用大语言模型进行实体解析的新型上下文聚类方法,通过直接对记录进行聚类,在显著提高准确率的同时有效降低了计算成本和API调用次数。
English Summary: This paper introduces a novel in-context clustering approach using Large Language Models for Entity Resolution, which directly clusters records to significantly improve accuracy while reducing computational costs and API calls.
Authors:Jiaming Li, Yukun Chen, Ziqiang Liu, Minghuan Tan, Lei Zhang, Yunshui Li, Run Luo, Longze Chen, Jing Luo, Ahmadreza Argha, Hamid Alinejad-Rokny, Wei Zhou, Min Yang
Abstract:
Stories are central to human culture, serving to share ideas, preserve traditions, and foster connections. Automatic story generation, a key advancement in artificial intelligence (AI), offers new possibilities for creating personalized content, exploring creative ideas, and enhancing interactive experiences. However, existing methods struggle to maintain narrative coherence and logical consistency. This disconnect compromises the overall storytelling experience, underscoring the need for substantial improvements. Inspired by human cognitive processes, we introduce Storyteller, a novel approach that systemically improves the coherence and consistency of automatically generated stories. Storyteller introduces a plot node structure based on linguistically grounded subject verb object (SVO) triplets, which capture essential story events and ensure a consistent logical flow. Unlike previous methods, Storyteller integrates two dynamic modules, the STORYLINE and narrative entity knowledge graph (NEKG),that continuously interact with the story generation process. This integration produces structurally sound, cohesive and immersive narratives. Extensive experiments demonstrate that Storyteller significantly outperforms existing approaches, achieving an 84.33% average win rate through human preference evaluation. At the same time, it is also far ahead in other aspects including creativity, coherence, engagement, and relevance.
中文: Storyteller是一种基于人工智能的新型故事生成方法,通过整合动态模块与语言基础的剧情结构,显著提升了叙述连贯性和逻辑一致性,在人类评估中远超现有方法。
English: Storyteller, a novel AI-based story generation approach, enhances narrative coherence and logical consistency by integrating dynamic modules with a linguistically grounded plot structure, significantly outperforming existing methods in human evaluations.
Authors:Zexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang, Kun Zhou, Yukun Ma, Bin Ma
Abstract:
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.
Chinese: 本文提出了一种轻量级的在线视听说话人提取模型,采用深度可分离卷积视觉前端和自回归声学编码器,在参数极少的情况下性能媲美现有最优方法,并能有效应对说话人注意力转移的挑战。
English: This paper introduces a lightweight online audio-visual speaker extraction model featuring a depth-wise separable visual frontend and an autoregressive acoustic encoder, which achieves comparable performance to state-of-the-art methods with minimal parameters and demonstrates robustness in handling shifts in speaker attention.
Authors:Mengyuan Liu, Hong Liu, Qianshuo Hu, Bin Ren, Junsong Yuan, Jiaying Lin, Jiajun Wen
Abstract:
With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.
中文: 本综述提出了一种任务导向的框架,通过将骨架动作识别分解为预处理、特征提取等子任务,并涵盖最新模型进展,弥补了以往研究忽视基础步骤的不足,为该领域提供了系统化路线图。
English: This review introduces a task-oriented framework to address the oversight of fundamental steps in previous skeleton-based action recognition studies, providing a comprehensive roadmap by decomposing the task into sub-tasks and highlighting recent advancements.
Authors:Daiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada
Abstract:
Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.
中文:CLAP-ART采用基于预训练音频表示的语义丰富离散标记来改进自动音频描述,在基准测试中表现优于基线EnCLAP方法。
English: CLAP-ART introduces semantic-rich discrete tokens derived from pre-trained audio representations to enhance Automated Audio Captioning, outperforming the baseline EnCLAP method on benchmark tests.
Authors:Xing Lei, Zifeng Zhuang, Shentao Yang, Sheng Xu, Yunhao Luo, Fei Shen, Wenyan Yang, Xuetao Zhang, Donglin Wang
Abstract:
Recently, supervised learning (SL) methodology has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency. However, recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches. A question naturally surfaces: \textit{How can we endow SL methods with stitching capability and close its performance gap with TD learning?} To answer this question, we introduce $Q$-conditioned maximization supervised learning for offline goal-conditioned RL, which enhances SL with the stitching capability through $Q$-conditioned policy and $Q$-conditioned maximization. Concretely, we propose \textbf{G}oal-\textbf{C}onditioned \textbf{\textit{Rein}}forced \textbf{S}upervised \textbf{L}earning (\textbf{GC\textit{Rein}SL}), which consists of (1) estimating the $Q$-function by Normalizing Flows from the offline dataset and (2) finding the maximum $Q$-value within the data support by integrating $Q$-function maximization with Expectile Regression. In inference time, our policy chooses optimal actions based on such a maximum $Q$-value. Experimental results from stitching evaluations on offline RL datasets demonstrate that our method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques.
中文: 离线强化学习中监督学习方法缺乏轨迹拼接能力,但提出的GCReinSL方法通过Q条件策略和最大化增强了这一能力,在评估中优于先前方法。
English: Supervised learning methods in offline reinforcement learning lack trajectory stitching, but the proposed GCReinSL approach enhances this capability through Q-conditioned policy and maximization, outperforming prior methods in evaluations.
Authors:Hongjie Zhu, Zezheng Zhang, Zeyu Zhang, Yu Bai, Shimin Wen, Huazhang Wang, Daji Ergu, Ying Cai, Yang Zhao
Abstract:
Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation in constraint modeling restricts the system's ability to develop diverse knowledge representations. Additionally, modeling the power grid solely based on spatial topology further limits the integration of additional prior knowledge, such as temporal information. To overcome these challenges, we propose DDA-PIGCN (Dynamic Domain Adaptation-Driven Physics-Informed Graph Convolutional Network), a new method designed to address constraint-related issues and build a graph-based learning framework that incorporates spatiotemporal features. DDA-PIGCN improves consistency optimization for features with varying long-range dependencies by applying multi-layer, hard physics-informed constraints. It also uses a dynamic domain adaptation learning mechanism that iteratively updates and refines key state variables under predefined constraints, enabling precise constraint verification. Moreover, it captures spatiotemporal dependencies between generators and loads by leveraging the physical structure of the power grid, allowing for deep integration of topological information across time and space. Extensive comparative and ablation studies show that DDA-PIGCN delivers strong performance across several IEEE standard test cases (such as case9, case30, and case300), achieving mean absolute errors (MAE) from 0.0011 to 0.0624 and constraint satisfaction rates between 99.6% and 100%, establishing it as a reliable and efficient AC-OPF solver.
中文: 提出的DDA-PIGCN方法通过整合时空特征和应用物理信息约束,克服了交流最优潮流求解器的局限性,在IEEE标准测试案例中展现出高精度和可靠性。
English: The proposed DDA-PIGCN method overcomes limitations in AC-OPF solvers by integrating spatiotemporal features and applying physics-informed constraints, demonstrating high accuracy and reliability across IEEE test cases.
Authors:Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Xianpeng Lang
Abstract:
Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. To resolve these issues through a unified pipeline, we introduce: 1) Differential visibility verification through SDF-guided ray tracing, resolving reflection ambiguities via continuous occlusion modeling; 2) Planar-conformal regularization via ray-aligned geometry patches that enforce local surface coherence while preserving sharp edges through adaptive appearance weighting; and 3) Physically-grounded Eikonal relaxation that dynamically modulates geometric constraints based on local radiance gradients, enabling detail preservation without sacrificing global regularity. Unlike prior methods that handle these aspects through sequential optimizations or isolated modules, our approach achieves cohesive integration where appearance-geometry constraints evolve synergistically throughout training. Comprehensive evaluations across synthetic and real-world datasets demonstrate state-of-the-art performance, including a 21.4% reduction in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR improvement against neural rendering counterparts. Qualitative analyses reveal superior capability in recovering specular instruments, urban layouts with centimeter-scale infrastructure, and low-textured surfaces without local patch collapse. The method's generalizability is further validated through successful application to inverse rendering tasks, including material decomposition and view-consistent relighting.
Chinese: HiNeuS 是一个统一框架,通过整合微分可见性验证、平面共形正则化和物理基础的Eikonal松弛方法,解决了神经表面重建中的核心难题,在各类场景中实现了卓越的几何精度与光照一致性。
English: HiNeuS is a unified framework that overcomes key challenges in neural surface reconstruction by integrating differential visibility verification, planar-conformal regularization, and physically-grounded Eikonal relaxation to achieve superior geometric accuracy and photometric consistency across diverse scenes.
Authors:Eugene J. Yu, Dawei Zhu, Yifan Song, Xiangyu Wong, Jiebin Zhang, Wenxuan Shi, Xiaoguang Li, Qun Liu, Sujian Li
Abstract:
Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
中文: 基于记忆组织的生成(MOG)框架通过将细粒度记忆单元组织成层次结构来指导生成过程,有效提升了维基百科文章的信息量、可验证性和引用可追溯性,同时减少了虚构内容。
English: The Memory Organization-based Generation (MOG) framework addresses the challenge of autonomously generating Wikipedia articles by organizing fine-grained memory units into a hierarchical structure, which guides the generation process to improve informativeness, verifiability, and citation traceability while reducing hallucinations.
Authors:Deyu Zou, Yongqiang Chen, Mufei Li, Siqi Miao, Chenxi Liu, Bo Han, James Cheng, Pan Li
Abstract:
Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.
中文: 基于图的检索增强生成(RAG)通过整合知识图谱中的结构化知识来提升大语言模型的准确性,但存在检索器因训练数据不足和输出混乱而表现不佳的问题;提出的ReG方法利用大语言模型反馈优化监督并重组证据链,在多个基准测试中显著提升了性能与效率。
English: Graph-based RAG enhances LLMs by integrating structured knowledge from knowledge graphs to reduce inaccuracies, but it faces challenges from weak retrievers due to inadequate training data and disorganized outputs; the proposed ReG method addresses these by using LLM feedback to refine supervision and reorganize evidence, achieving significant performance gains and efficiency improvements across benchmarks.
Authors:Yubo Peng, Luping Xiang, Kun Yang, Feibo Jiang, Kezhi Wang, Christos Masouros
Abstract:
The evolution towards 6G networks requires the intelligent integration of communication and sensing capabilities to support diverse and complex applications, such as autonomous driving and immersive services. However, existing integrated sensing and communication (ISAC) systems predominantly rely on single-modal sensors as primary participants, which leads to a limited representation of environmental features and significant performance bottlenecks under the emerging requirements of 6G applications. This limitation motivates a paradigm shift from single-modal to multimodal ISAC. In this article, we first analyze the key challenges in realizing multimodal ISAC, including the fusion of heterogeneous multimodal data, the high communication overhead among distributed sensors, and the design of efficient and scalable system architectures. We then introduce several enabling technologies, such as large AI models, semantic communication, and multi-agent systems, that hold promise for addressing these challenges. To operationalize these technologies, we zoom into three architectural paradigms: fusion-based multimodal ISAC (F-MAC), interaction-based multimodal ISAC (I-MAC), and relay-based multimodal ISAC (R-MAC), each tailored to organize devices and modalities for efficient collaboration in different scenarios. Thereafter, a case study is presented based on the F-MAC scheme, demonstrating that the scheme achieves more comprehensive sensing and improves sensing accuracy by approximately 80% compared to conventional single-modal ISAC systems. Finally, we discuss several open issues to be addressed in the future.
中文: 向6G网络的演进要求从单模态转向多模态的集成感知与通信,以克服性能瓶颈,通过人工智能模型和语义通信等技术实现,其中F-MAC架构方案将感知精度提升了约80%。
English: The transition to 6G networks necessitates a shift from single-modal to multimodal integrated sensing and communication (ISAC) to overcome performance limitations, enabled by technologies like AI models and semantic communication, with proposed architectures such as F-MAC demonstrating an 80% improvement in sensing accuracy.
Authors:Shaheer U. Saeed, Yipei Wang, Veeru Kasivisvanathan, Brian R. Davidson, Matthew J. Clarkson, Yipeng Hu, Daniel C. Alexander
Abstract:
Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.
中文: 本研究受人类双过程认知启发,提出一种新型学习范式,使机器能在推理过程中通过迭代优化提升视觉任务中的非语言推理能力,在医学影像等数据稀缺场景下超越了监督学习模型甚至人类专家表现。
English: This study introduces a novel learning paradigm inspired by human dual-process cognition, enabling machines to improve reasoning in vision tasks through iterative refinement during inference, outperforming supervised models and human experts in data-scarce scenarios like medical imaging.
Authors:Weimin Xiong, Ke Wang, Yifan Song, Hanchao Liu, Sai Zhou, Wei Peng, Sujian Li
Abstract:
Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.
中文: 当前工具集成LLM智能体的评估忽视稳定性,而我们的研究发现它们在工具调用各阶段均易出错,开源模型更脆弱,增大模型规模未必提升推理能力反而增加攻击风险。
English: Current tool-integrated LLM agent evaluations overlook stability, but our study reveals their vulnerability to errors across all tool-invocation stages, with open-source models being more susceptible and larger models not necessarily improving reasoning while increasing attack risks.
Authors:Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan
Abstract:
In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.
中文摘要:本报告介绍了TrajTok轨迹分词器,它结合数据驱动与规则方法提升行为生成效果,并采用空间感知标签平滑技术,在Waymo挑战赛中取得0.7852真实度评分,代码将开源。
English Summary: This report presents TrajTok, a trajectory tokenizer that integrates data-driven and rule-based approaches for improved behavior generation, along with a spatial-aware label smoothing technique, achieving a 0.7852 realism score in the Waymo challenge with future code release.
Authors:Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao, Dongsheng Li, Lili Qiu, Wenjie Li
Abstract:
Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user's reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.
中文摘要:本研究通过将对话未来信号作为奖励,采用蒙特卡洛树搜索交互和直接偏好优化方法,有效提升了交互式大语言模型的用户参与度。
English Summary: This study enhances user engagement in interactive LLMs by using future conversation signals as rewards and optimizing models through Monte Carlo Tree Search-based interaction and direct preference optimization.
Authors:Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus
Abstract:
Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.
中文: 本文提出了一种新颖的分层状态空间模型,能够通过捕捉局部和全局动态来高效分析完整手术视频,在多个数据集上显著超越了现有方法。
English: This paper introduces a novel hierarchical state space model that efficiently analyzes full-length surgical videos by capturing both local and global dynamics, significantly outperforming existing methods across multiple datasets.
Authors:Li Fan, Peng Wang, Jing Yang, Cong Shen
Abstract:
Transformers have shown potential in solving wireless communication problems, particularly via in-context learning (ICL), where models adapt to new tasks through prompts without requiring model updates. However, prior ICL-based Transformer models rely on deep architectures with many layers to achieve satisfactory performance, resulting in substantial storage and computational costs. In this work, we propose CHain Of thOught Symbol dEtection (CHOOSE), a CoT-enhanced shallow Transformer framework for wireless symbol detection. By introducing autoregressive latent reasoning steps within the hidden space, CHOOSE significantly improves the reasoning capacity of shallow models (1-2 layers) without increasing model depth. This design enables lightweight Transformers to achieve detection performance comparable to much deeper models, making them well-suited for deployment on resource-constrained mobile devices. Experimental results demonstrate that our approach outperforms conventional shallow Transformers and achieves performance comparable to that of deep Transformers, while maintaining storage and computational efficiency. This represents a promising direction for implementing Transformer-based algorithms in wireless receivers with limited computational resources.
中文摘要:CHOOSE框架通过引入隐藏空间中的自回归推理步骤,显著提升浅层Transformer(1-2层)的无线符号检测能力,在保持计算效率的同时达到与深层模型相当的性能。
English Summary: The CHOOSE framework enhances shallow Transformers for wireless symbol detection by introducing latent reasoning steps, enabling lightweight models with 1-2 layers to match deep model performance while maintaining computational efficiency.
Authors:Zhuochen Miao, Jun Lv, Hongjie Fang, Yang Jin, Cewu Lu
Abstract:
Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on https://knowledge-driven.github.io/.
中文摘要:本文提出知识驱动的模仿学习框架,利用外部语义知识抽象同类物体表示,通过语义关键点图实现高效泛化,仅需少量专家演示即可在多种真实场景中超越现有方法。
English Summary: This paper introduces knowledge-driven imitation learning, a framework that uses semantic knowledge to enhance robot manipulation by abstracting object representations and employing a semantic keypoint graph for improved generalization with fewer demonstrations.
Authors:Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson
Abstract:
Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.
中文摘要:本文开发了五个面向真实场景的跨领域RAG应用,通过用户评估总结了影响系统可靠性与实用性的十二项关键技术、运营及伦理挑战经验。
English Summary: This paper develops five real-world RAG applications across multiple domains, evaluating them through user studies and documenting twelve key lessons about technical, operational, and ethical challenges affecting RAG system reliability and usability.
Authors:Yaxi Chen, Simin Ni, Shaheer U. Saeed, Aleksandra Ivanova, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Abstract:
Accurate interpretation of knee MRI scans relies on expert clinical judgment, often with high variability and limited scalability. Existing radiomic approaches use a fixed set of radiomic features (the signature), selected at the population level and applied uniformly to all patients. While interpretable, these signatures are often too constrained to represent individual pathological variations. As a result, conventional radiomic-based approaches are found to be limited in performance, compared with recent end-to-end deep learning (DL) alternatives without using interpretable radiomic features. We argue that the individual-agnostic nature in current radiomic selection is not central to its intepretability, but is responsible for the poor generalization in our application. Here, we propose a novel radiomic fingerprint framework, in which a radiomic feature set (the fingerprint) is dynamically constructed for each patient, selected by a DL model. Unlike the existing radiomic signatures, our fingerprints are derived on a per-patient basis by predicting the feature relevance in a large radiomic feature pool, and selecting only those that are predictive of clinical conditions for individual patients. The radiomic-selecting model is trained simultaneously with a low-dimensional (considered relatively explainable) logistic regression for downstream classification. We validate our methods across multiple diagnostic tasks including general knee abnormalities, anterior cruciate ligament (ACL) tears, and meniscus tears, demonstrating comparable or superior diagnostic accuracy relative to state-of-the-art end-to-end DL models. More importantly, we show that the interpretability inherent in our approach facilitates meaningful clinical insights and potential biomarker discovery, with detailed discussion, quantitative and qualitative analysis of real-world clinical cases to evidence these advantages.
中文: 该研究提出了一种新型放射组学指纹框架,通过深度学习模型为每位患者动态筛选个性化放射组学特征,在保持临床可解释性的同时,诊断准确率与端到端深度学习方法相当甚至更优。
English: The proposed radiomic fingerprint framework dynamically selects personalized radiomic features for each patient using a deep learning model, achieving diagnostic accuracy comparable to end-to-end deep learning methods while maintaining interpretability for clinical insights.
Authors:Jiaxing Huang, Heng Guo, Le Lu, Fan Yang, Minfeng Xu, Ge Yang, Wei Luo
Abstract:
Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative for osteoporosis diagnosis using existing imaging data. Current approaches, however, face three limitations: (1) underutilization of unlabeled vertebral data, (2) systematic bias from device-specific DXA discrepancies, and (3) insufficient integration of clinical knowledge such as spatial BMD distribution patterns. To address these, we propose a unified deep learning framework with three innovations. First, a self-supervised learning method using radiomic representations to leverage unlabeled CT data and preserve bone texture. Second, a Mixture of Experts (MoE) architecture with learned gating mechanisms to enhance cross-device adaptability. Third, a multi-task learning framework integrating osteoporosis diagnosis, BMD regression, and vertebra location prediction. Validated across three clinical sites and an external hospital, our approach demonstrates superior generalizability and accuracy over existing methods for opportunistic osteoporosis screening and diagnosis.
中文: 本研究提出统一深度学习框架,通过自监督学习、跨设备适应和多任务整合解决了基于CT的骨质疏松诊断中的关键局限,在多个临床场景中验证了其优越的准确性和泛化能力。
English: This study introduces a unified deep learning framework that overcomes limitations in opportunistic CT-based osteoporosis diagnosis through self-supervised learning, cross-device adaptation, and multi-task integration, demonstrating superior accuracy and generalizability across clinical settings.
Authors:Fangzhi Li, Zhichu Ren, Cunhua Pan, Hong Ren, Jing Jin, Qixing Wang, Jiangzhou Wang
Abstract:
To empower the low-altitude economy with high-accuracy sensing and high-rate communication, this paper proposes a cooperative integrated sensing and communication (ISAC) framework for aerial-ground networks. In the proposed system, the ground base stations (BSs) cooperatively serve the unmanned aerial vehicles (UAVs), which are equipped for either joint communication and sensing or sensing-only operations. The BSs employ coordinated beamforming to simultaneously transmit communication and sensing signals, while the UAVs execute their missions. To maximize the weighted sum rate under the sensing signal-to-interference-plus-noise ratio (SINR) constraints, we jointly optimize the transmit beamforming, receive filtering, and UAV trajectory. The resulting non-convex problem is solved using an alternating optimization framework incorporating semidefinite relaxation (SDR) and successive convex approximation (SCA). Simulation results demonstrate that the proposed joint design achieves higher communication throughput while ensuring required sensing robustness. Additionally, the sensing SINR threshold and the UAV altitude have a significant impact on the trajectory design, highlighting the necessity of adaptive deployment strategies in practical applications.
中文摘要:本文提出一种空地网络协同通感融合框架,通过联合优化波束成形、接收滤波和无人机轨迹,在满足感知性能要求的同时最大化通信吞吐量。
English Summary: This paper introduces a cooperative integrated sensing and communication framework for aerial-ground networks that jointly optimizes beamforming, filtering, and UAV trajectories to maximize communication rates while meeting sensing requirements.
Authors:Junjie Xu, Jiahao Zhang, Mangal Prakash, Xiang Zhang, Suhang Wang
Abstract:
Geometric graph neural networks (GNNs) that respect E(3) symmetries have achieved strong performance on small molecule modeling, but they face scalability and expressiveness challenges when applied to large biomolecules such as RNA and proteins. These systems require models that can simultaneously capture fine-grained atomic interactions, long-range dependencies across spatially distant components, and biologically relevant hierarchical structure, such as atoms forming residues, which in turn form higher-order domains. Existing geometric GNNs, which typically operate exclusively in either Euclidean or Spherical Harmonics space, are limited in their ability to capture both the fine-scale atomic details and the long-range, symmetry-aware dependencies required for modeling the multi-scale structure of large biomolecules. We introduce DualEquiNet, a Dual-Space Hierarchical Equivariant Network that constructs complementary representations in both Euclidean and Spherical Harmonics spaces to capture local geometry and global symmetry-aware features. DualEquiNet employs bidirectional cross-space message passing and a novel Cross-Space Interaction Pooling mechanism to hierarchically aggregate atomic features into biologically meaningful units, such as residues, enabling efficient and expressive multi-scale modeling for large biomolecular systems. DualEquiNet achieves state-of-the-art performance on multiple existing benchmarks for RNA property prediction and protein modeling, and outperforms prior methods on two newly introduced 3D structural benchmarks demonstrating its broad effectiveness across a range of large biomolecule modeling tasks.
中文: DualEquiNet通过构建欧几里得空间和球谐空间的双重表征,采用跨空间信息传递机制实现层次化特征聚合,在保持E(3)对称性的同时有效捕捉生物大分子的多尺度结构特征。
English: DualEquiNet introduces a dual-space hierarchical equivariant network that integrates Euclidean and Spherical Harmonics representations through cross-space interactions, enabling superior multi-scale biomolecular modeling by capturing both atomic details and long-range dependencies.
Authors:Saeed Mahloujifar, Chuan Guo, G. Edward Suh, Kamalika Chaudhuri
Abstract:
Differential privacy (DP) has become the standard for private data analysis. Certain machine learning applications only require privacy protection for specific protected attributes. Using naive variants of differential privacy in such use cases can result in unnecessary degradation of utility. In this work, we refine the definition of DP to create a more general and flexible framework that we call feature differential privacy (FDP). Our definition is simulation-based and allows for both addition/removal and replacement variants of privacy, and can handle arbitrary and adaptive separation of protected and non-protected features. We prove the properties of FDP, such as adaptive composition, and demonstrate its implications for limiting attribute inference attacks. We also propose a modification of the standard DP-SGD algorithm that satisfies FDP while leveraging desirable properties such as amplification via sub-sampling. We apply our framework to various machine learning tasks and show that it can significantly improve the utility of DP-trained models when public features are available. For example, we train diffusion models on the AFHQ dataset of animal faces and observe a drastic improvement in FID compared to DP, from 286.7 to 101.9 at $ε=8$, assuming that the blurred version of a training image is available as a public feature. Overall, our work provides a new approach to private data analysis that can help reduce the utility cost of DP while still providing strong privacy guarantees.
中文摘要:本文提出了特征差分隐私(FDP)框架,通过针对特定属性提供精准隐私保护,在保持强隐私保障的同时,相比标准差分隐私显著提升了模型实用性。
English Summary: This paper introduces Feature Differential Privacy (FDP), a refined framework that provides targeted privacy protection for specific attributes, significantly improving model utility while maintaining strong privacy guarantees compared to standard differential privacy.
Authors:Yuanhe Tian, Lei Mao, Yan Song
Abstract:
Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.
中文: 本文提出了一种基于大语言模型的CT报告生成方法,通过循环处理CT切片并采用立体注意力机制分层建模视觉特征,实现与文本特征的对齐,在M3D-Cap数据集上取得了最优性能。
English: This paper introduces a novel LLM-based method for CT report generation that recurrently processes CT slices with stereo attention mechanisms to hierarchically model visual features and align them with text, achieving state-of-the-art performance on the M3D-Cap dataset.
Authors:Ozgur O. Kilic, David K. Park, Yihui Ren, Tatiana Korchuganova, Sairam Sri Vatsavai, Joseph Boudreau, Tasnuva Chowdhury, Shengyu Feng, Raees Khan, Jaehyung Kim, Scott Klasky, Tadashi Maeno, Paul Nilsson, Verena Ingrid Martinez Outschoorn, Norbert Podhorszki, Frédéric Suter, Wei Yang, Yiming Yang, Shinjae Yoo, Alexei Klimentov, Adolfy Hoisie
Abstract:
Large-scale scientific collaborations like ATLAS, Belle II, CMS, DUNE, and others involve hundreds of research institutes and thousands of researchers spread across the globe. These experiments generate petabytes of data, with volumes soon expected to reach exabytes. Consequently, there is a growing need for computation, including structured data processing from raw data to consumer-ready derived data, extensive Monte Carlo simulation campaigns, and a wide range of end-user analysis. To manage these computational and storage demands, centralized workflow and data management systems are implemented. However, decisions regarding data placement and payload allocation are often made disjointly and via heuristic means. A significant obstacle in adopting more effective heuristic or AI-driven solutions is the absence of a quick and reliable introspective dynamic model to evaluate and refine alternative approaches. In this study, we aim to develop such an interactive system using real-world data. By examining job execution records from the PanDA workflow management system, we have pinpointed key performance indicators such as queuing time, error rate, and the extent of remote data access. The dataset includes five months of activity. Additionally, we are creating a generative AI model to simulate time series of payloads, which incorporate visible features like category, event count, and submitting group, as well as hidden features like the total computational load-derived from existing PanDA records and computing site capabilities. These hidden features, which are not visible to job allocators, whether heuristic or AI-driven, influence factors such as queuing times and data movement.
Chinese: 本研究利用来自PanDA工作流管理系统的真实数据开发了一个交互式系统,通过生成式AI模型模拟负载时间序列,以解决大规模科学合作中数据放置和分配优化缺乏动态模型的问题。
English: This study develops an interactive system using real-world data from the PanDA workflow management system to create a generative AI model that simulates payload time series, addressing the lack of dynamic models for optimizing data placement and allocation in large-scale scientific collaborations.
Authors:Shahbaz Siddeeq, Muhammad Waseem, Zeeshan Rasheed, Md Mahade Hasan, Jussi Rasku, Mika Saari, Henri Terho, Kalle Makela, Kai-Kristian Kemell, Pekka Abrahamsson
Abstract:
Refactoring is a constant activity in software development and maintenance. Scale and maintain software systems are based on code refactoring. However, this process is still labor intensive, as it requires programmers to analyze the codebases in detail to avoid introducing new defects. In this research, we put forward a large language model (LLM)-based multi-agent system to automate the refactoring process on Haskell code. The objective of this research is to evaluate the effect of LLM-based agents in performing structured and semantically accurate refactoring on Haskell code. Our proposed multi-agent system based on specialized agents with distinct roles, including code analysis, refactoring execution, verification, and debugging. To test the effectiveness and practical applicability of the multi-agent system, we conducted evaluations using different open-source Haskell codebases. The results of the experiments carried out showed that the proposed LLM-based multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%. Furthermore, memory allocation was optimized by up to 14.57%. These results highlight the ability of LLM-based multi-agent in managing refactoring tasks targeted toward functional programming paradigms. Our findings hint that LLM-based multi-agent systems integration into the refactoring of functional programming languages can enhance maintainability and support automated development workflows.
本研究提出了一种基于大语言模型的多智能体系统,通过实验评估证明其能够自动化Haskell代码重构,并在代码复杂度、质量、性能效率和内存分配方面实现显著优化。
This research introduces a large language model-based multi-agent system to automate Haskell code refactoring, demonstrating significant improvements in code complexity, quality, performance efficiency, and memory allocation through experimental evaluations.
Authors:Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
Abstract:
Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.
中文: RLVR式推理削弱大型语言模型对人类标注分歧的建模能力,而简单的思维链推理反而提升其表现,这警示在分歧至关重要时用推理型大模型替代人工标注存在潜在风险。
English: RLVR-style reasoning impairs LLMs' ability to model human annotation disagreements, whereas simple Chain-of-Thought reasoning enhances performance in disagreement modeling, highlighting risks in substituting human annotators with reasoning LLMs when disagreements matter.
Authors:Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas
Abstract:
Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder's parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn't improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model's parameters.
中文: 本研究提出了一种在基于Transformer的语言变分自编码器中嵌入显式推理规则的完整流程,实验表明该方法能有效分离推理规则并整合先验知识,同时揭示了数学推理任务中的性能瓶颈问题。
English: This study introduces a pipeline for embedding explicit reasoning rules into Transformer-based language variational autoencoders, demonstrating improved rule disentanglement and prior knowledge integration while identifying performance bottlenecks in mathematical reasoning tasks.
Authors:Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, Jian Zhang
Abstract:
Recent advances in AI-generated content (AIGC) have led to the emergence of powerful text-to-video generation models. Despite these successes, evaluating the quality of AIGC-generated videos remains challenging due to limited generalization, lack of temporal awareness, heavy reliance on large-scale annotated datasets, and the lack of effective interaction with generation models. Most current approaches rely on supervised finetuning of vision-language models (VLMs), which often require large-scale annotated datasets and tend to decouple understanding and generation. To address these shortcomings, we propose VQ-Insight, a novel reasoning-style VLM framework for AIGC video quality assessment. Our approach features: (1) a progressive video quality learning scheme that combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model; (2) the design of multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards to enhance both generalization and specialization in video quality evaluation. Extensive experiments demonstrate that VQ-Insight consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, bringing significant improvements for video generation tasks.
中文: 针对当前AI生成视频评估方法泛化性差、缺乏时序感知等问题,VQ-Insight创新性地提出渐进式学习框架与多维度奖励机制,在视频质量评估中显著优于现有方法。
English: Recent AI-generated video evaluation methods face challenges in generalization and temporal awareness, prompting the development of VQ-Insight, a reasoning-style framework that enhances video quality assessment through progressive learning and multi-dimensional rewards, outperforming existing baselines.
Authors:Francesco Marchiori, Marco Alecci, Luca Pajola, Mauro Conti
Abstract:
Adversarial examples are small and often imperceptible perturbations crafted to fool machine learning models. These attacks seriously threaten the reliability of deep neural networks, especially in security-sensitive domains. Evasion attacks, a form of adversarial attack where input is modified at test time to cause misclassification, are particularly insidious due to their transferability: adversarial examples crafted against one model often fool other models as well. This property, known as adversarial transferability, complicates defense strategies since it enables black-box attacks to succeed without direct access to the victim model. While adversarial training is one of the most widely adopted defense mechanisms, its effectiveness is typically evaluated on a narrow and homogeneous population of models. This limitation hinders the generalizability of empirical findings and restricts practical adoption.
In this work, we introduce DUMBer, an attack framework built on the foundation of the DUMB (Dataset soUrces, Model architecture, and Balance) methodology, to systematically evaluate the resilience of adversarially trained models. Our testbed spans multiple adversarial training techniques evaluated across three diverse computer vision tasks, using a heterogeneous population of uniquely trained models to reflect real-world deployment variability. Our experimental pipeline comprises over 130k evaluations spanning 13 state-of-the-art attack algorithms, allowing us to capture nuanced behaviors of adversarial training under varying threat models and dataset conditions. Our findings offer practical, actionable insights for AI practitioners, identifying which defenses are most effective based on the model, dataset, and attacker setup.
Chinese: 对抗样本是微小且难以察觉的扰动,旨在欺骗机器学习模型,对深度神经网络的可靠性构成严重威胁,本研究基于DUMB方法开发了DUMBer攻击框架,系统评估对抗训练模型在不同条件下的鲁棒性,为实际应用提供有效的防御建议。
English: Adversarial examples are subtle perturbations that deceive machine learning models, posing significant risks in security-sensitive applications, and this study introduces DUMBer, a framework to systematically assess the resilience of adversarially trained models across diverse conditions, providing actionable insights for effective defenses.
Authors:Chengjie Liu, Weiyu Chen, Huiyao Xu, Yuan Du, Jun Yang, Li Du
Abstract:
In the design process of the analog circuit pre-layout phase, device sizing is an important step in determining whether an analog circuit can meet the required performance metrics. Many existing techniques extract the circuit sizing task as a mathematical optimization problem to solve and continuously improve the optimization efficiency from a mathematical perspective. But they ignore the automatic introduction of prior knowledge, fail to achieve effective pruning of the search space, which thereby leads to a considerable compression margin remaining in the search space. To alleviate this problem, we propose a large language model (LLM)-based multi-agent framework for analog circuits' sizing relationships extraction from academic papers. The search space in the sizing process can be effectively pruned based on the sizing relationship extracted by this framework. Eventually, we conducted tests on 3 types of circuits, and the optimization efficiency was improved by $2.32 \sim 26.6 \times$. This work demonstrates that the LLM can effectively prune the search space for analog circuit sizing, providing a new solution for the combination of LLMs and conventional analog circuit design automation methods.
Chinese: 本文提出了一种基于大语言模型的多智能体框架,用于从学术论文中提取模拟电路的尺寸关系,有效缩减设计搜索空间,使优化效率提升2.32至26.6倍。
English: This paper introduces a multi-agent framework based on large language models (LLMs) to extract sizing relationships from academic papers, effectively pruning the search space in analog circuit design and improving optimization efficiency by 2.32 to 26.6 times.
Authors:Zhenkun Zhang, Yining Xu, Cunhua Pan, Hong Ren, Yiming Yu, Jiangzhou Wang
Abstract:
The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling robust sensing services for LAE applications through standardized 5G New Radio (NR) infrastructure. We first develop a low-complexity parameter extraction algorithm employing CANDECOMP/PARAFAC (CP) tensor decomposition, which exploits the inherent Vandermonde structure in delay-related factor matrices to efficiently recover bistatic ranges, Doppler velocities, and angles-of-arrival (AoA) from multi-dimensional received signal tensors. To resolve data association ambiguity across distributed transmitter-receiver pairs and mitigate erroneous estimates, we further design a robust fusion scheme based on the minimum spanning tree (MST) method, enabling joint 3D position and velocity reconstruction. Comprehensive simulation results validate the framework's superiority in computational efficiency and sensing performance for low-altitude scenarios.
中文: 本文提出了一种基于5G NR基础设施的协作式双基地ISAC框架,通过低复杂度的CP张量分解算法提取参数,并采用最小生成树融合方案实现精确的3D定位和速度估计,为低空经济应用提供鲁棒的感知服务。
English: This paper proposes a cooperative bistatic ISAC framework using 5G NR infrastructure to enable robust multi-target sensing for low-altitude applications, featuring a low-complexity CP tensor decomposition algorithm for parameter extraction and an MST-based fusion scheme for accurate 3D positioning and velocity estimation.
Authors:Lingxiao Zeng, Yiqi Tong, Wei Guo, Huarui Wu, Lihao Ge, Yijun Ye, Fuzhen Zhuang, Deqing Wang, Wei Guo, Cheng Chen
Abstract:
Agricultural named entity recognition is a specialized task focusing on identifying distinct agricultural entities within vast bodies of text, including crops, diseases, pests, and fertilizers. It plays a crucial role in enhancing information extraction from extensive agricultural text resources. However, the scarcity of high-quality agricultural datasets, particularly in Chinese, has resulted in suboptimal performance when employing mainstream methods for this purpose. Most earlier works only focus on annotating agricultural entities while overlook the profound correlation of agriculture with hydrology and meteorology. To fill this blank, we present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation. The AgriCHN dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions spanning 27 diverse entity categories. Furthermore, it encompasses entities from hydrology to meteorology, thereby enriching the diversity of entities considered. Data validation reveals that, compared with relevant resources, AgriCHN demonstrates outstanding data quality, attributable to its richer agricultural entity types and more fine-grained entity divisions. A benchmark task has also been constructed using several state-of-the-art neural NER models. Extensive experimental results highlight the significant challenge posed by AgriCHN and its potential for further research.
中文:该摘要介绍了AgriCHN,一个全面的开源中文数据集,旨在通过纳入农业、水文和气象等多样化实体来提升农业命名实体识别的准确性,解决了高质量资源稀缺的问题,并通过广泛验证和基准测试展示了其进一步研究的巨大潜力。
English: The abstract introduces AgriCHN, a comprehensive open-source Chinese dataset designed to enhance agricultural named entity recognition by including diverse entities from agriculture, hydrology, and meteorology, addressing the scarcity of high-quality resources and demonstrating significant potential for further research through extensive validation and benchmarking.
Authors:Adnan Qidwai, Srija Mukhopadhyay, Prerana Khatiwada, Dan Roth, Vivek Gupta
Abstract:
Accurate and complete product descriptions are crucial for e-commerce, yet seller-provided information often falls short. Customer reviews offer valuable details but are laborious to sift through manually. We present PRAISE: Product Review Attribute Insight Structuring Engine, a novel system that uses Large Language Models (LLMs) to automatically extract, compare, and structure insights from customer reviews and seller descriptions. PRAISE provides users with an intuitive interface to identify missing, contradictory, or partially matching details between these two sources, presenting the discrepancies in a clear, structured format alongside supporting evidence from reviews. This allows sellers to easily enhance their product listings for clarity and persuasiveness, and buyers to better assess product reliability. Our demonstration showcases PRAISE's workflow, its effectiveness in generating actionable structured insights from unstructured reviews, and its potential to significantly improve the quality and trustworthiness of e-commerce product catalogs.
中文: PRAISE系统利用大型语言模型自动从客户评论和卖家描述中提取并结构化信息,帮助用户识别差异,从而提升电商产品列表的清晰度和可信度。
English: PRAISE is a novel system utilizing Large Language Models to automatically extract and structure insights from customer reviews and seller descriptions, enabling users to identify discrepancies and improve e-commerce product listings for enhanced clarity and trustworthiness.
Authors:Rishi Bommasani, Scott R. Singer, Ruth E. Appel, Sarah Cen, A. Feder Cooper, Elena Cryst, Lindsey A. Gailmard, Ian Klaus, Meredith M. Lee, Inioluwa Deborah Raji, Anka Reuel, Drew Spence, Alexander Wan, Angelina Wang, Daniel Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer Tour Chayes, Mariano-Florentino Cuellar, Li Fei-Fei
Abstract:
The innovations emerging at the frontier of artificial intelligence (AI) are poised to create historic opportunities for humanity but also raise complex policy challenges. Continued progress in frontier AI carries the potential for profound advances in scientific discovery, economic productivity, and broader social well-being. As the epicenter of global AI innovation, California has a unique opportunity to continue supporting developments in frontier AI while addressing substantial risks that could have far reaching consequences for the state and beyond. This report leverages broad evidence, including empirical research, historical analysis, and modeling and simulations, to provide a framework for policymaking on the frontier of AI development. Building on this multidisciplinary approach, this report derives policy principles that can inform how California approaches the use, assessment, and governance of frontier AI: principles rooted in an ethos of trust but verify. This approach takes into account the importance of innovation while establishing appropriate strategies to reduce material risks.
Chinese: 前沿人工智能在带来科学发现和经济发展的巨大机遇的同时,也需通过"信任但验证"的政策框架管控风险,加州作为创新中心具有引领发展与治理的双重使命。
English: Frontier AI offers immense opportunities for scientific and economic advancement but requires careful policy to manage risks, with California uniquely positioned to lead in both innovation and governance through a "trust but verify" approach.
Authors:Vladislav Esaulov, Jieyang Chen, Norbert Podhorszki, Fred Suter, Scott Klasky, Anu G Bourgeois, Lipeng Wan
Abstract:
In modern science, the growing complexity of large-scale projects has increased reliance on cross-facility workflows, where institutions share resources and expertise to accelerate discovery. These workflows often involve transferring massive data over wide-area networks. While high-speed networks like ESnet and data transfer services like Globus have improved data mobility, challenges remain. Large data volumes can strain bandwidth, TCP suffers from retransmissions due to packet loss, and traditional fault-tolerance methods like erasure coding introduce significant overhead.
This paper presents JANUS, a resilient and adaptive data transmission approach for cross-facility scientific workflows. JANUS uses UDP, integrates erasure coding for fault tolerance, and applies error-bounded lossy compression to reduce overhead. This design enables users to balance transmission time and accuracy based on specific needs. JANUS also adapts coding parameters to real-time network conditions and uses optimization models to determine ideal configurations. Experiments show that JANUS significantly improves data transfer efficiency while preserving fidelity.
中文摘要:JANUS是一种用于跨机构科学工作流的弹性数据传输方法,采用UDP协议、集成纠删码和误差有损压缩技术,能根据网络条件自适应调整参数,在保证数据精度的同时显著提升传输效率。
English Summary: JANUS is a resilient data transmission method for scientific workflows that uses UDP, erasure coding, and lossy compression to enhance efficiency and adapt to network conditions while maintaining data fidelity.
Authors:Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins
Abstract:
Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.
中文摘要:针对特定任务(如机器翻译)微调大语言模型通常会削弱通用能力,而Tower+模型通过创新的多阶段训练方案,在翻译专业化与多语言通用能力之间实现了帕累托最优平衡。
English summary: Fine-tuning LLMs for specialized tasks like machine translation often compromises general capabilities, but Tower+ models achieve a Pareto balance between translation expertise and multilingual general-purpose skills through a novel multi-stage training approach.
Authors:Ruiming Chen, Junming Yang, Shiyu Xia, Xu Yang, Jing Wang, Xin Geng
Abstract:
CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8 times pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.
中文:MM-LG是一种新颖框架,通过从CLIP中提取多模态通用知识来高效初始化不同规模的衍生模型,在显著降低计算成本和存储需求的同时实现了更优的性能表现。
English: MM-LG is a novel framework that extracts multimodal generalizable knowledge from CLIP to efficiently initialize descendant models, achieving superior performance with significantly reduced computational costs and storage requirements.
Authors:Raghav Mehta, Fabio De Sousa Ribeiro, Tian Xia, Melanie Roschewitz, Ainkaran Santhirasekaram, Dominic C. Marshall, Ben Glocker
Abstract:
Segmenting anatomical structures in medical images plays an important role in the quantitative assessment of various diseases. However, accurate segmentation becomes significantly more challenging in the presence of disease. Disease patterns can alter the appearance of surrounding healthy tissues, introduce ambiguous boundaries, or even obscure critical anatomical structures. As such, segmentation models trained on real-world datasets may struggle to provide good anatomical segmentation, leading to potential misdiagnosis. In this paper, we generate counterfactual (CF) images to simulate how the same anatomy would appear in the absence of disease without altering the underlying structure. We then use these CF images to segment structures of interest, without requiring any changes to the underlying segmentation model. Our experiments on two real-world clinical chest X-ray datasets show that the use of counterfactual images improves anatomical segmentation, thereby aiding downstream clinical decision-making.
中文: 通过生成反事实图像模拟无病变的解剖结构,无需改动分割模型即可提升分割精度,辅助临床决策。
English: Counterfactual images are generated to simulate disease-free anatomy, improving segmentation accuracy and clinical decision-making without modifying the segmentation model.
Authors:Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou
Abstract:
Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
中文:Fractional Reasoning是一种无需训练的框架,通过在推理时缩放潜在引导向量来实现对推理强度的连续控制,从而在多种推理任务中同步提升广度策略和深度策略的测试时计算效果。
English: Fractional Reasoning is a training-free framework that enables continuous control over reasoning intensity during inference by scaling latent steering vectors, improving both breadth-based and depth-based test-time compute strategies across various reasoning tasks.
Authors:Xixi Hu, Runlong Liao, Keyang Xu, Bo Liu, Yeqing Li, Eugene Ie, Hongliang Fei, Qiang Liu
Abstract:
Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However, we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is particularly critical during stochastic sampling at inference, as the score function's errors are amplified near the boundary. To mitigate this, we propose a Boundary-enforced Rectified Flow Model (Boundary RF Model), in which we enforce boundary conditions with a minimal code modification. Boundary RF Model improves performance over vanilla RF model, demonstrating 8.01% improvement in FID score on ImageNet using ODE sampling and 8.98% improvement using SDE sampling.
Chinese: 边界约束整流流模型通过强制边界条件解决了无约束神经网络在学习速度场时的局限性,相比原始模型显著提升了性能,如在ImageNet上使用ODE采样时FID得分提高了8.01%。
English: The Boundary-enforced Rectified Flow Model addresses the limitation of unconstrained neural networks in learning accurate velocity fields by enforcing boundary conditions, resulting in significant performance improvements over the vanilla model, such as an 8.01% FID score gain on ImageNet with ODE sampling.
Authors:Taylor Lynn Curtis, Maximilian Puelma Touzel, William Garneau, Manon Gruaz, Mike Pinder, Li Wei Wang, Sukanya Krishna, Luda Cohen, Jean-François Godbout, Reihaneh Rabbany, Kellin Pelrine
Abstract:
The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.
中文:Veracity是一个开源AI系统,它结合大型语言模型与网络检索技术,通过多语言支持和直观解释来核实信息真伪,旨在打击虚假信息并提升公众媒介素养。
English: Veracity is an open-source AI system that uses large language models and web retrieval to fact-check claims with multilingual support and intuitive explanations, aiming to combat misinformation and enhance media literacy.
Authors:Yunhak Oh, Junseok Lee, Yeongmin Kim, Sangwoo Seo, Namkyeong Lee, Chanyoung Park
Abstract:
Spatially Resolved Transcriptomics (SRT) is a cutting-edge technique that captures the spatial context of cells within tissues, enabling the study of complex biological networks. Recent graph-based methods leverage both gene expression and spatial information to identify relevant spatial domains. However, these approaches fall short in obtaining meaningful spot representations, especially for spots near spatial domain boundaries, as they heavily emphasize adjacent spots that have minimal feature differences from an anchor node. To address this, we propose Spotscape, a novel framework that introduces the Similarity Telescope module to capture global relationships between multiple spots. Additionally, we propose a similarity scaling strategy to regulate the distances between intra- and inter-slice spots, facilitating effective multi-slice integration. Extensive experiments demonstrate the superiority of Spotscape in various downstream tasks, including single-slice and multi-slice scenarios. Our code is available at the following link: https: //github.com/yunhak0/Spotscape.
Chinese: Spotscape是一种创新框架,通过引入相似性望远镜模块捕捉全局点关系,并采用相似性缩放策略实现有效的多切片整合,在各种下游任务中展现出卓越性能。
English: Spotscape is a novel framework that enhances spatially resolved transcriptomics by introducing a Similarity Telescope module to capture global spot relationships and a scaling strategy for effective multi-slice integration, demonstrating superior performance in downstream tasks.
Authors:Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, Jiang Bian
Abstract:
Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.
Chinese: 本文通过可处理的线性模型分析潜在动作模型(LAMs),探讨其是否捕捉可控变化或无关噪声,揭示了与主成分分析(PCA)的关联,并验证了数据增强等策略对学习动作相关变化的有效性。
English: This paper analyzes latent action models (LAMs) to determine whether they capture controllable changes or irrelevant noise in unlabeled videos, using a tractable linear model to reveal connections with PCA and validate strategies like data augmentation for learning action-relevant changes.
Authors:Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du
Abstract:
Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbf{BASE-Q}, a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors. Furthermore, BASE-Q enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation. Extensive experiments on various LLMs and benchmarks demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5\%, 42.9\%, and 29.2\% compared to QuaRot, SpinQuant, and OSTQuant, respectively. The code will be released soon.
中文:旋转在LLM量化中虽能平滑异常值,但存在通道均值未对齐和分布高斯化的问题,BASE-Q通过偏置校正和非对称缩放有效减少误差,并实现高效的块状优化。
English: Rotations in LLM quantization smooth outliers but face limitations in aligning channel means and handling Gaussian-like distributions, which BASE-Q addresses through bias correction and asymmetric scaling to reduce errors and enable efficient blockwise optimization.
Authors:Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du
Abstract:
Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbf{BASE-Q}, a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors. Furthermore, BASE-Q enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation. Extensive experiments on various LLMs and benchmarks demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5\%, 42.9\%, and 29.2\% compared to QuaRot, SpinQuant, and OSTQuant, respectively. The code will be released soon.
中文:旋转在LLM量化中虽能平滑异常值,但存在通道均值未对齐和分布高斯化的问题,BASE-Q通过偏置校正和非对称缩放有效减少误差,并实现高效的块状优化。
English: Rotations in LLM quantization smooth outliers but face limitations in aligning channel means and handling Gaussian-like distributions, which BASE-Q addresses through bias correction and asymmetric scaling to reduce errors and enable efficient blockwise optimization.
Authors:Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, Zian Wang
Abstract:
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
中文: 本文提出一种统一框架,通过视频扩散模型联合估计反照率并生成重照明内容,利用合成与真实视频数据训练,在泛化能力和视觉保真度方面均超越现有方法。
English: This paper presents a unified framework that jointly estimates albedo and generates relit content using video diffusion models, achieving superior generalization and outperforming prior methods in visual quality and temporal coherence by training on synthetic and real-world video data.
Authors:Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Xun Zhou, Hui Wang, Yue Yu, Zenglin Xu
Abstract:
Large language models (LLMs) have transformed numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains, such as healthcare and finance, is constrained by the scarcity of accessible training data due to stringent privacy requirements. Secure multi-party computation (MPC)-based privacy-preserving machine learning offers a powerful approach to protect both model parameters and user data, but its application to LLMs has been largely limited to inference, as fine-tuning introduces significant computational challenges, particularly in privacy-preserving backward propagation and optimizer operations. This paper identifies two primary obstacles to MPC-based privacy-preserving fine-tuning of LLMs: (1) the substantial computational overhead of backward and optimizer processes, and (2) the inefficiency of softmax-based attention mechanisms in MPC settings. To address these challenges, we propose SecFwT, the first MPC-based framework designed for efficient, privacy-preserving LLM fine-tuning. SecFwT introduces a forward-only tuning paradigm to eliminate backward and optimizer computations and employs MPC-friendly Random Feature Attention to approximate softmax attention, significantly reducing costly non-linear operations and computational complexity. Experimental results demonstrate that SecFwT delivers substantial improvements in efficiency and privacy preservation, enabling scalable and secure fine-tuning of LLMs for privacy-critical applications.
中文摘要:SecP-Tuning首次提出基于多方安全计算的高效隐私保护提示调优框架,在实现显著加速和通信开销降低的同时,在少样本任务中性能优于传统微调方法。
English Summary: SecP-Tuning introduces the first MPC-based framework for efficient privacy-preserving prompt tuning of LLMs, achieving significant acceleration and communication reduction while outperforming traditional methods in few-shot tasks.
Authors:Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Hui Wang, Yue Yu, Xun Zhou, Zenglin Xu
Abstract:
Large Language Models (LLMs) have revolutionized numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains such as healthcare and finance remains constrained due to the scarcity of accessible training data caused by stringent privacy requirements. Secure Multi-party Computation (MPC)-based privacy-preserving machine learning provides theoretical guarantees for the privacy of model parameters and data. However, its application to LLMs has been predominantly limited to inference, as fine-tuning introduces significant efficiency challenges, particularly in backward propagation, optimizer, and self-attention operations. To address these challenges, we propose SecP-Tuning, the first MPC-based framework designed for efficient, privacy-preserving prompt tuning of LLMs. SecP-Tuning innovatively integrates Forward-only Tuning (FoT) through the ``data owner-server interaction" paradigm, effectively removing the need for privacy-preserving computations in backward propagation and optimization processes. Furthermore, it devises an efficient privacy-preserving Random Feature Attention (RFA), effectively mitigating the computational complexity of softmax-based self-attention and circumventing MPC-incompatible nonlinear operations. Experimental results demonstrate that, compared to full-Parameter Supervised Fine-Tuning (SFT) and gradient-based prompt tuning, SecP-Tuning achieves approximately 12 times and 16 times end-to-end acceleration, as well as 18 times and 20 times reductions in communication overhead, respectively. Moreover, in five few-shot tasks, it achieves an average performance score of 82.45, outperforming SFT's 79.90 and prompt tuning's 73.73. Additionally, the ``black-box/API-style" privacy-preserving tuning paradigm of SecP-Tuning effectively avoids memory leakage risks caused by gradient/parameter transmission.
中文摘要:SecP-Tuning首次提出基于多方安全计算的高效隐私保护提示调优框架,在实现显著加速和通信开销降低的同时,在少样本任务中性能优于传统微调方法。
English Summary: SecP-Tuning introduces the first MPC-based framework for efficient privacy-preserving prompt tuning of LLMs, achieving significant acceleration and communication reduction while outperforming traditional methods in few-shot tasks.
Authors:Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, Fei Deng
Abstract:
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.
中文: RA-NeRF通过结合光度一致性约束、光流驱动位姿校正和隐式位姿滤波,在复杂相机轨迹下实现了最先进的位姿估计与场景重建效果。
English: RA-NeRF is a novel method that achieves state-of-the-art camera pose estimation and scene reconstruction under complex trajectories by integrating photometric consistency, flow-driven regulation, and an implicit pose filter.
Authors:Daniel D'souza, Julia Kreutzer, Adrien Morisot, Ahmet Ãstün, Sara Hooker
Abstract:
One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: "Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?" We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.
中文: 现代机器学习在处理罕见特征上面临挑战,本研究通过引入基于数据特征与任务来源的标记系统对模型进行微调,显著提升了模型在长尾用例中的表现与控制能力,尤其在 underrepresented 领域获得超过9%的性能提升。
English: Modern machine learning struggles with underrepresented features, but this work introduces a method to enhance model performance and controllability on long-tail cases by fine-tuning with explicit and implicit markers, achieving significant gains especially in rare domains.
Authors:Fabien Bernier, Maxime Cordy, Yves Le Traon
Abstract:
Accurate electrical consumption forecasting is crucial for efficient energy management and resource allocation. While traditional time series forecasting relies on historical patterns and temporal dependencies, incorporating external factors -- such as weather indicators -- has shown significant potential for improving prediction accuracy in complex real-world applications. However, the inclusion of these additional features often degrades the performance of global predictive models trained on entire populations, despite improving individual household-level models. To address this challenge, we found that a hypernetwork architecture can effectively leverage external factors to enhance the accuracy of global electrical consumption forecasting models, by specifically adjusting the model weights to each consumer.
We collected a comprehensive dataset spanning two years, comprising consumption data from over 6000 luxembourgish households and corresponding external factors such as weather indicators, holidays, and major local events. By comparing various forecasting models, we demonstrate that a hypernetwork approach outperforms existing methods when associated to external factors, reducing forecasting errors and achieving the best accuracy while maintaining the benefits of a global model.
中文: 超网络架构通过结合天气等外部因素并针对每个消费者调整模型权重,有效提升了全球电力消耗预测的准确性,在保持全局模型优势的同时超越了现有方法。
English: A hypernetwork architecture effectively improves global electrical consumption forecasting by adapting model weights to individual consumers using external factors like weather, outperforming traditional methods while maintaining global model advantages.
Authors:Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, Ben Glocker
Abstract:
Counterfactual image generation aims to simulate realistic visual outcomes under specific causal interventions. Diffusion models have recently emerged as a powerful tool for this task, combining DDIM inversion with conditional generation via classifier-free guidance (CFG). However, standard CFG applies a single global weight across all conditioning variables, which can lead to poor identity preservation and spurious attribute changes - a phenomenon known as attribute amplification. To address this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic framework that introduces group-wise conditioning control. DCFG builds on an attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups. For counterfactual generation, we partition attributes into intervened and invariant sets based on a causal graph and apply distinct guidance to each. Experiments on CelebA-HQ, MIMIC-CXR, and EMBED show that DCFG improves intervention fidelity, mitigates unintended changes, and enhances reversibility, enabling more faithful and interpretable counterfactual image generation.
中文摘要:本文提出解耦无分类器引导(DCFG)方法,通过因果图和属性分离嵌入策略实现按属性精确调控,有效解决了传统方法在反事实生成中因全局引导尺度导致的虚假变化问题。
English Summary: This paper introduces Decoupled Classifier-Free Guidance (DCFG), a novel technique that overcomes the limitation of uniform guidance scales in counterfactual generation by enabling attribute-wise control through causal graphs and disentangled embeddings.
Authors:Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, Ben Glocker
Abstract:
Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifier-free guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. DCFG is implemented via a simple attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups.
中文摘要:本文提出解耦无分类器引导(DCFG)方法,通过因果图和属性分离嵌入策略实现按属性精确调控,有效解决了传统方法在反事实生成中因全局引导尺度导致的虚假变化问题。
English Summary: This paper introduces Decoupled Classifier-Free Guidance (DCFG), a novel technique that overcomes the limitation of uniform guidance scales in counterfactual generation by enabling attribute-wise control through causal graphs and disentangled embeddings.
Authors:Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee
Abstract:
Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.
中文: X-Scene是一种创新的大规模驾驶场景生成框架,通过多粒度控制、顺序生成3D语义占据与多视角图像以及一致性感知的场景外绘,实现了几何精细度和外观保真度,显著提升了自动驾驶应用中场景生成的操控性与真实感。
English: X-Scene is a novel framework for large-scale driving scene generation that achieves geometric intricacy and appearance fidelity through multi-granular control, sequential generation of 3D semantic occupancy and multiview images, and consistency-aware scene outpainting, significantly advancing controllability and fidelity for autonomous driving applications.
Authors:Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Shengjia Hua, Lei Chen, Zhi Wang
Abstract:
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose Block-wise Adaptive Caching(BAC), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and locks. To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to signiffcant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with signiffcant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3x inference speedup for free.
The paper introduces Block-wise Adaptive Caching (BAC), a training-free method that accelerates Diffusion Policy for robotic control by adaptively caching and reusing intermediate action features, achieving up to 3x speedup without performance loss.
English Summary:
Authors:Ze Cheng, Zhuoyu Li, Xiaoqiang Wang, Jianing Huang, Zhizhou Zhang, Zhongkai Hao, Hang Su
Abstract:
PDE-Constrained Optimization (PDECO) problems can be accelerated significantly by employing gradient-based methods with surrogate models like neural operators compared to traditional numerical solvers. However, this approach faces two key challenges: (1) **Data inefficiency**: Lack of efficient data sampling and effective training for neural operators, particularly for optimization purpose. (2) **Instability**: High risk of optimization derailment due to inaccurate neural operator predictions and gradients. To address these challenges, we propose a novel framework: (1) **Optimization-oriented training**: we leverage data from full steps of traditional optimization algorithms and employ a specialized training method for neural operators. (2) **Enhanced derivative learning**: We introduce a *Virtual-Fourier* layer to enhance derivative learning within the neural operator, a crucial aspect for gradient-based optimization. (3) **Hybrid optimization**: We implement a hybrid approach that integrates neural operators with numerical solvers, providing robust regularization for the optimization process. Our extensive experimental results demonstrate the effectiveness of our model in accurately learning operators and their derivatives. Furthermore, our hybrid optimization approach exhibits robust convergence.
中文: 我们提出的框架通过优化导向训练、采用虚拟傅里叶层增强导数学习以及神经算子与数值求解器的混合优化方法,有效解决了偏微分方程约束优化中的数据低效和稳定性问题,实验证明该模型能准确学习算子并实现稳健收敛。
English: Our proposed framework overcomes data inefficiency and instability in PDE-constrained optimization by introducing optimization-oriented training, enhanced derivative learning with a Virtual-Fourier layer, and a hybrid neural operator-numerical solver approach, achieving robust convergence and accurate operator learning in experiments.
Authors:Yuanhe Tian, Xu Li, Wei Wang, Guoqing Jin, Pengsen Cheng, Yan Song
Abstract:
Aspect-based sentiment analysis (ABSA) generally requires a deep understanding of the contextual information, including the words associated with the aspect terms and their syntactic dependencies. Most existing studies employ advanced encoders (e.g., pre-trained models) to capture such context, especially large language models (LLMs). However, training these encoders is resource-intensive, and in many cases, the available data is insufficient for necessary fine-tuning. Therefore it is challenging for learning LLMs within such restricted environments and computation efficiency requirement. As a result, it motivates the exploration of plug-and-play methods that adapt LLMs to ABSA with minimal effort. In this paper, we propose an approach that integrates extendable components capable of incorporating various types of syntactic knowledge, such as constituent syntax, word dependencies, and combinatory categorial grammar (CCG). Specifically, we propose a memory module that records syntactic information and is incorporated into LLMs to instruct the prediction of sentiment polarities. Importantly, this encoder acts as a versatile, detachable plugin that is trained independently of the LLM. We conduct experiments on benchmark datasets, which show that our approach outperforms strong baselines and previous approaches, thus demonstrates its effectiveness.
中文: 本文提出了一种即插即用的记忆模块,通过将句法知识融入大语言模型来改进方面级情感分析,在基准数据集上以更低计算成本实现了更优性能。
English: This paper introduces a plug-and-play memory module that integrates syntactic knowledge into large language models to enhance aspect-based sentiment analysis, achieving superior performance on benchmark datasets with reduced computational demands.
Authors:Marco Arazzi, Antonino Nocera, Vinod P
Abstract:
Machine unlearning has emerged as a key component in ensuring ``Right to be Forgotten'', enabling the removal of specific data points from trained models. However, even when the unlearning is performed without poisoning the forget-set (clean unlearning), it can be exploited for stealthy attacks that existing defenses struggle to detect. In this paper, we propose a novel {\em clean} backdoor attack that exploits both the model learning phase and the subsequent unlearning requests. Unlike traditional backdoor methods, during the first phase, our approach injects a weak, distributed malicious signal across multiple classes. The real attack is then activated and amplified by selectively unlearning {\em non-poisoned} samples. This strategy results in a powerful and stealthy novel attack that is hard to detect or mitigate, highlighting critical vulnerabilities in current unlearning mechanisms and highlighting the need for more robust defenses.
中文: 本文提出了一种新型的干净后门攻击,通过在训练阶段植入微弱恶意信号,并利用选择性遗忘非污染样本进行放大,形成难以检测的强大威胁,揭示了当前遗忘机制的关键漏洞。
English: This paper introduces a novel clean backdoor attack that exploits machine unlearning by initially embedding a weak malicious signal during training and then amplifying it through selective unlearning of non-poisoned samples, creating a stealthy and potent threat that exposes vulnerabilities in current unlearning defenses.
Authors:Hongwei Zhang, Ziqi Ye, Xinyuan Wang, Xin Guo, Zenglin Xu, Yuan Cheng, Zixin Hu, Yuan Qi
Abstract:
We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputs $X \in \mathbb R^{d \times N}$ and outputs $Y \in \mathbb R^{m \times N}$, while capturing the correlation structure among the $Y$. NARD employs a matrix normal prior which contains a sparsity-inducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship between $Y$ and the refined inputs. To mitigate the computational inefficiencies of the $\mathcal O(m^3 + d^3)$ cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced to $\mathcal O(m^3+p^3)$, $\mathcal O(m^3 + d^2)$, $\mathcal O(m^3+p^2)$, respectively, where $p \ll d$ is the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.
Chinese: 我们提出了网络自动相关性确定(NARD)方法,该法在建模稀疏输入-输出关系的同时捕捉输出相关性,并通过开发序列NARD和替代函数方法显著降低了计算复杂度,同时保持了相当的模型性能。
English: We introduce Network Automatic Relevance Determination (NARD), a method that models sparse input-output relationships while capturing output correlations, and we develop Sequential NARD and a Surrogate Function Method to significantly reduce its computational complexity while maintaining comparable performance.
Authors:Zongyu Wu, Minhua Lin, Zhiwei Zhang, Fali Wang, Xianren Zhang, Xiang Zhang, Suhang Wang
Abstract:
Large vision-language models (LVLMs) have demonstrated outstanding performance in many downstream tasks. However, LVLMs are trained on large-scale datasets, which can pose privacy risks if training images contain sensitive information. Therefore, it is important to detect whether an image is used to train the LVLM. Recent studies have investigated membership inference attacks (MIAs) against LVLMs, including detecting image-text pairs and single-modality content. In this work, we focus on detecting whether a target image is used to train the target LVLM. We design simple yet effective Image Corruption-Inspired Membership Inference Attacks (ICIMIA) against LLVLMs, which are inspired by LVLM's different sensitivity to image corruption for member and non-member images. We first perform an MIA method under the white-box setting, where we can obtain the embeddings of the image through the vision part of the target LVLM. The attacks are based on the embedding similarity between the image and its corrupted version. We further explore a more practical scenario where we have no knowledge about target LVLMs and we can only query the target LVLMs with an image and a question. We then conduct the attack by utilizing the output text embeddings' similarity. Experiments on existing datasets validate the effectiveness of our proposed attack methods under those two different settings.
中文: 本研究提出了基于图像损坏的成员推断攻击方法(ICIMIA),通过利用大型视觉语言模型对成员与非成员图像在图像损坏处理上表现出的不同敏感性,来检测目标图像是否用于模型训练,并在白盒和仅查询两种场景下验证了方法的有效性。
English: This study introduces Image Corruption-Inspired Membership Inference Attacks (ICIMIA), a method to detect whether an image was used in training large vision-language models by exploiting their differing sensitivity to image corruption for member versus non-member images, validated under both white-box and query-only settings.
Authors:Huanqiang Duan, Manno Versluis, Qinyu Chen, Leo C. N. de Vreede, Chang Gao
Abstract:
Digital predistortion (DPD) is essential for mitigating nonlinearity in RF power amplifiers, particularly for wideband applications. This paper presents TCN-DPD, a parameter-efficient architecture based on temporal convolutional networks, integrating noncausal dilated convolutions with optimized activation functions. Evaluated on the OpenDPD framework with the DPA_200MHz dataset, TCN-DPD achieves simulated ACPRs of -51.58/-49.26 dBc (L/R), EVM of -47.52 dB, and NMSE of -44.61 dB with 500 parameters and maintains superior linearization than prior models down to 200 parameters, making it promising for efficient wideband PA linearization.
中文: 本文提出TCN-DPD这一参数高效的时域卷积网络架构,仅需200个参数即可为宽带射频功率放大器实现优于先前模型的线性化性能,有效抑制非线性失真。
English: This paper introduces TCN-DPD, a parameter-efficient temporal convolutional network architecture that achieves superior linearization for wideband RF power amplifiers with as few as 200 parameters, demonstrating excellent performance in reducing nonlinear distortion.
Authors:Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Abstract:
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All models will be released publicly.
中文: 在同等资源条件下,优化后的专家混合模型能够超越密集模型,其最优性能在不同规模下保持一致,并通过数据重用解决了所需数据量增加的问题。
English: Mixture-of-Experts models can outperform dense models under equal resource constraints when optimized, with consistent performance across scales despite a data trade-off resolved through reuse.
Authors:Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue, Alan He, Alessandra Cervone, Alex Loeb, Alex Zhang, Alexander Fu, Alexander Lisnichenko, Alexander Zhipa, Alexandros Potamianos, Ali Kebarighotbi, Aliakbar Daronkolaei, Alok Parmesh, Amanjot Kaur Samra, Ameen Khan, Amer Rez, Amir Saffari, Amit Agarwalla, Amit Jhindal, Amith Mamidala, Ammar Asmro, Amulya Ballakur, Anand Mishra, Anand Sridharan, Anastasiia Dubinina, Andre Lenz, Andreas Doerr, Andrew Keating, Andrew Leaver, Andrew Smith, Andrew Wirth, Andy Davey, Andy Rosenbaum, Andy Sohn, Angela Chan, Aniket Chakrabarti, Anil Ramakrishna, Anirban Roy, Anita Iyer, Anjali Narayan-Chen, Ankith Yennu, Anna Dabrowska, Anna Gawlowska, Anna Rumshisky, Anna Turek, Anoop Deoras, Anton Bezruchkin, Anup Prasad, Anupam Dewan, Anwith Kiran, Apoorv Gupta, Aram Galstyan, Aravind Manoharan, Arijit Biswas, Arindam Mandal, Arpit Gupta, Arsamkhan Pathan, Arun Nagarajan, Arushan Rajasekaram, Arvind Sundararajan, Ashwin Ganesan, Ashwin Swaminathan, Athanasios Mouchtaris, Audrey Champeau, Avik Ray, Ayush Jaiswal, Ayush Sharma, Bailey Keefer, Balamurugan Muthiah, Beatriz Leon-Millan, Ben Koopman, Ben Li, Benjamin Biggs, Benjamin Ott, Bhanu Vinzamuri, Bharath Venkatesh, Bhavana Ganesh, Bhoomit Vasani, Bill Byrne, Bill Hsu, Bincheng Wang, Blake King, Blazej Gorny, Bo Feng, Bo Zheng, Bodhisattwa Paul, Bofan Sun, Bofeng Luo, Bowen Chen, Bowen Xie, Boya Yu, Brendan Jugan, Brett Panosh, Brian Collins, Brian Thompson, Can Karakus, Can Liu, Carl Lambrecht, Carly Lin, Carolyn Wang, Carrie Yuan, Casey Loyda, Cezary Walczak, Chalapathi Choppa, Chandana Satya Prakash, Chankrisna Richy Meas, Charith Peris, Charles Recaido, Charlie Xu, Charul Sharma, Chase Kernan, Chayut Thanapirom, Chengwei Su, Chenhao Xu, Chenhao Yin, Chentao Ye, Chenyang Tao, Chethan Parameshwara, Ching-Yun Chang, Chong Li, Chris Hench, Chris Tran, Christophe Dupuy, Christopher Davis, Christopher DiPersio, Christos Christodoulopoulos, Christy Li, Chun Chen, Claudio Delli Bovi, Clement Chung, Cole Hawkins, Connor Harris, Corey Ropell, Cynthia He, DK Joo, Dae Yon Hwang, Dan Rosen, Daniel Elkind, Daniel Pressel, Daniel Zhang, Danielle Kimball, Daniil Sorokin, Dave Goodell, Davide Modolo, Dawei Zhu, Deepikaa Suresh, Deepti Ragha, Denis Filimonov, Denis Foo Kune, Denis Romasanta Rodriguez, Devamanyu Hazarika, Dhananjay Ram, Dhawal Parkar, Dhawal Patel, Dhwanil Desai, Dinesh Singh Rajput, Disha Sule, Diwakar Singh, Dmitriy Genzel, Dolly Goldenberg, Dongyi He, Dumitru Hanciu, Dushan Tharmal, Dzmitry Siankovich, Edi Cikovic, Edwin Abraham, Ekraam Sabir, Elliott Olson, Emmett Steven, Emre Barut, Eric Jackson, Ethan Wu, Evelyn Chen, Ezhilan Mahalingam, Fabian Triefenbach, Fan Yang, Fangyu Liu, Fanzi Wu, Faraz Tavakoli, Farhad Khozeimeh, Feiyang Niu, Felix Hieber, Feng Li, Firat Elbey, Florian Krebs, Florian Saupe, Florian Sprünken, Frank Fan, Furqan Khan, Gabriela De Vincenzo, Gagandeep Kang, George Ding, George He, George Yeung, Ghada Qaddoumi, Giannis Karamanolakis, Goeric Huybrechts, Gokul Maddali, Gonzalo Iglesias, Gordon McShane, Gozde Sahin, Guangtai Huang, Gukyeong Kwon, Gunnar A. Sigurdsson, Gurpreet Chadha, Gururaj Kosuru, Hagen Fuerstenau, Hah Hah, Haja Maideen, Hajime Hosokawa, Han Liu, Han-Kai Hsu, Hann Wang, Hao Li, Hao Yang, Haofeng Zhu, Haozheng Fan, Harman Singh, Harshavardhan Kaluvala, Hashim Saeed, He Xie, Helian Feng, Hendrix Luo, Hengzhi Pei, Henrik Nielsen, Hesam Ilati, Himanshu Patel, Hongshan Li, Hongzhou Lin, Hussain Raza, Ian Cullinan, Imre Kiss, Inbarasan Thangamani, Indrayani Fadnavis, Ionut Teodor Sorodoc, Irem Ertuerk, Iryna Yemialyanava, Ishan Soni, Ismail Jelal, Ivan Tse, Jack FitzGerald, Jack Zhao, Jackson Rothgeb, Jacky Lee, Jake Jung, Jakub Debski, Jakub Tomczak, James Jeun, James Sanders, Jason Crowley, Jay Lee, Jayakrishna Anvesh Paidy, Jayant Tiwari, Jean Farmer, Jeff Solinsky, Jenna Lau, Jeremy Savareese, Jerzy Zagorski, Ji Dai, Jiacheng, Gu, Jiahui Li, Jian, Zheng, Jianhua Lu, Jianhua Wang, Jiawei Dai, Jiawei Mo, Jiaxi Xu, Jie Liang, Jie Yang, Jim Logan, Jimit Majmudar, Jing Liu, Jinghong Miao, Jingru Yi, Jingyang Jin, Jiun-Yu Kao, Jixuan Wang, Jiyang Wang, Joe Pemberton, Joel Carlson, Joey Blundell, John Chin-Jew, John He, Jonathan Ho, Jonathan Hueser, Jonathan Lunt, Jooyoung Lee, Joshua Tan, Joyjit Chatterjee, Judith Gaspers, Jue Wang, Jun Fang, Jun Tang, Jun Wan, Jun Wu, Junlei Wang, Junyi Shi, Justin Chiu, Justin Satriano, Justin Yee, Jwala Dhamala, Jyoti Bansal, Kai Zhen, Kai-Wei Chang, Kaixiang Lin, Kalyan Raman, Kanthashree Mysore Sathyendra, Karabo Moroe, Karan Bhandarkar, Karan Kothari, Karolina Owczarzak, Karthick Gopalswamy, Karthick Ravi, Karthik Ramakrishnan, Karthika Arumugam, Kartik Mehta, Katarzyna Konczalska, Kavya Ravikumar, Ke Tran, Kechen Qin, Kelin Li, Kelvin Li, Ketan Kulkarni, Kevin Angelo Rodrigues, Keyur Patel, Khadige Abboud, Kiana Hajebi, Klaus Reiter, Kris Schultz, Krishna Anisetty, Krishna Kotnana, Kristen Li, Kruthi Channamallikarjuna, Krzysztof Jakubczyk, Kuba Pierewoj, Kunal Pal, Kunwar Srivastav, Kyle Bannerman, Lahari Poddar, Lakshmi Prasad, Larry Tseng, Laxmikant Naik, Leena Chennuru Vankadara, Lenon Minorics, Leo Liu, Leonard Lausen, Leonardo F. R. Ribeiro, Li Zhang, Lili Gehorsam, Ling Qi, Lisa Bauer, Lori Knapp, Lu Zeng, Lucas Tong, Lulu Wong, Luoxin Chen, Maciej Rudnicki, Mahdi Namazifar, Mahesh Jaliminche, Maira Ladeira Tanke, Manasi Gupta, Mandeep Ahlawat, Mani Khanuja, Mani Sundaram, Marcin Leyk, Mariusz Momotko, Markus Boese, Markus Dreyer, Markus Mueller, Mason Fu, Mateusz Górski, Mateusz Mastalerczyk, Matias Mora, Matt Johnson, Matt Scott, Matthew Wen, Max Barysau, Maya Boumerdassi, Maya Krishnan, Mayank Gupta, Mayank Hirani, Mayank Kulkarni, Meganathan Narayanasamy, Melanie Bradford, Melanie Gens, Melissa Burke, Meng Jin, Miao Chen, Michael Denkowski, Michael Heymel, Michael Krestyaninov, Michal Obirek, Michalina Wichorowska, MichaŠMiotk, Milosz Watroba, Mingyi Hong, Mingzhi Yu, Miranda Liu, Mohamed Gouda, Mohammad El-Shabani, Mohammad Ghavamzadeh, Mohit Bansal, Morteza Ziyadi, Nan Xia, Nathan Susanj, Nav Bhasin, Neha Goswami, Nehal Belgamwar, Nicolas Anastassacos, Nicolas Bergeron, Nidhi Jain, Nihal Jain, Niharika Chopparapu, Nik Xu, Nikko Strom, Nikolaos Malandrakis, Nimisha Mishra, Ninad Parkhi, Ninareh Mehrabi, Nishita Sant, Nishtha Gupta, Nitesh Sekhar, Nithin Rajeev, Nithish Raja Chidambaram, Nitish Dhar, Noor Bhagwagar, Noy Konforty, Omar Babu, Omid Razavi, Orchid Majumder, Osama Dar, Oscar Hsu, Pablo Kvitca, Pallavi Pandey, Parker Seegmiller, Patrick Lange, Paul Ferraro, Payal Motwani, Pegah Kharazmi, Pei Wang, Pengfei Liu, Peter Bradtke, Peter Götz, Peter Zhou, Pichao Wang, Piotr Poskart, Pooja Sonawane, Pradeep Natarajan, Pradyun Ramadorai, Pralam Shah, Prasad Nirantar, Prasanthi Chavali, Prashan Wanigasekara, Prashant Saraf, Prashun Dey, Pratyush Pant, Prerak Pradhan, Preyaa Patel, Priyanka Dadlani, Prudhvee Narasimha Sadha, Qi Dong, Qian Hu, Qiaozi, Gao, Qing Liu, Quinn Lam, Quynh Do, R. Manmatha, Rachel Willis, Rafael Liu, Rafal Ellert, Rafal Kalinski, Rafi Al Attrach, Ragha Prasad, Ragini Prasad, Raguvir Kunani, Rahul Gupta, Rahul Sharma, Rahul Tewari, Rajaganesh Baskaran, Rajan Singh, Rajiv Gupta, Rajiv Reddy, Rajshekhar Das, Rakesh Chada, Rakesh Vaideeswaran Mahesh, Ram Chandrasekaran, Ramesh Nallapati, Ran Xue, Rashmi Gangadharaiah, Ravi Rachakonda, Renxian Zhang, Rexhina Blloshmi, Rishabh Agrawal, Robert Enyedi, Robert Lowe, Robik Shrestha, Robinson Piramuthu, Rohail Asad, Rohan Khanna, Rohan Mukherjee, Rohit Mittal, Rohit Prasad, Rohith Mysore Vijaya Kumar, Ron Diamant, Ruchita Gupta, Ruiwen Li, Ruoying Li, Rushabh Fegade, Ruxu Zhang, Ryan Arbow, Ryan Chen, Ryan Gabbard, Ryan Hoium, Ryan King, Sabarishkumar Iyer, Sachal Malick, Sahar Movaghati, Sai Balakavi, Sai Jakka, Sai Kashyap Paruvelli, Sai Muralidhar Jayanthi, Saicharan Shriram Mujumdar, Sainyam Kapoor, Sajjad Beygi, Saket Dingliwal, Saleh Soltan, Sam Ricklin, Sam Tucker, Sameer Sinha, Samridhi Choudhary, Samson Tan, Samuel Broscheit, Samuel Schulter, Sanchit Agarwal, Sandeep Atluri, Sander Valstar, Sanjana Shankar, Sanyukta Sanyukta, Sarthak Khanna, Sarvpriye Khetrapal, Satish Janakiraman, Saumil Shah, Saurabh Akolkar, Saurabh Giri, Saurabh Khandelwal, Saurabh Pawar, Saurabh Sahu, Sean Huang, Sejun Ra, Senthilkumar Gopal, Sergei Dobroshinsky, Shadi Saba, Shamik Roy, Shamit Lal, Shankar Ananthakrishnan, Sharon Li, Shashwat Srijan, Shekhar Bhide, Sheng Long Tang, Sheng Zha, Shereen Oraby, Sherif Mostafa, Shiqi Li, Shishir Bharathi, Shivam Prakash, Shiyuan Huang, Shreya Yembarwar, Shreyas Pansare, Shreyas Subramanian, Shrijeet Joshi, Shuai Liu, Shuai Tang, Shubham Chandak, Shubham Garg, Shubham Katiyar, Shubham Mehta, Shubham Srivastav, Shuo Yang, Siddalingesha D S, Siddharth Choudhary, Siddharth Singh Senger, Simon Babb, Sina Moeini, Siqi Deng, Siva Loganathan, Slawomir Domagala, Sneha Narkar, Sneha Wadhwa, Songyang Zhang, Songyao Jiang, Sony Trenous, Soumajyoti Sarkar, Soumya Saha, Sourabh Reddy, Sourav Dokania, Spurthideepika Sandiri, Spyros Matsoukas, Sravan Bodapati, Sri Harsha Reddy Wdaru, Sridevi Yagati Venkateshdatta, Srikanth Ronanki, Srinivasan R Veeravanallur, Sriram Venkatapathy, Sriramprabhu Sankaraguru, Sruthi Gorantla, Sruthi Karuturi, Stefan Schroedl, Subendhu Rongali, Subhasis Kundu, Suhaila Shakiah, Sukriti Tiwari, Sumit Bharti, Sumita Sami, Sumith Mathew, Sunny Yu, Sunwoo Kim, Suraj Bajirao Malode, Susana Cumplido Riel, Swapnil Palod, Swastik Roy, Syed Furqhan, Tagyoung Chung, Takuma Yoshitani, Taojiannan Yang, Tejaswi Chillakura, Tejwant Bajwa, Temi Lajumoke, Thanh Tran, Thomas Gueudre, Thomas Jung, Tianhui Li, Tim Seemman, Timothy Leffel, Tingting Xiang, Tirth Patel, Tobias Domhan, Tobias Falke, Toby Guo, Tom Li, Tomasz Horszczaruk, Tomasz Jedynak, Tushar Kulkarni, Tyst Marin, Tytus Metrycki, Tzu-Yen Wang, Umang Jain, Upendra Singh, Utkarsh Chirimar, Vaibhav Gupta, Vanshil Shah, Varad Deshpande, Varad Gunjal, Varsha Srikeshava, Varsha Vivek, Varun Bharadwaj, Varun Gangal, Varun Kumar, Venkatesh Elango, Vicente Ordonez, Victor Soto, Vignesh Radhakrishnan, Vihang Patel, Vikram Singh, Vinay Varma Kolanuvada, Vinayshekhar Bannihatti Kumar, Vincent Auvray, Vincent Cartillier, Vincent Ponzo, Violet Peng, Vishal Khandelwal, Vishal Naik, Vishvesh Sahasrabudhe, Vitaliy Korolev, Vivek Gokuladas, Vivek Madan, Vivek Subramanian, Volkan Cevher, Vrinda Gupta, Wael Hamza, Wei Zhang, Weitong Ruan, Weiwei Cheng, Wen Zhang, Wenbo Zhao, Wenyan Yao, Wenzhuo Ouyang, Wesley Dashner, William Campbell, William Lin, Willian Martin, Wyatt Pearson, Xiang Jiang, Xiangxing Lu, Xiangyang Shi, Xianwen Peng, Xiaofeng Gao, Xiaoge Jiang, Xiaohan Fei, Xiaohui Wang, Xiaozhou Joey Zhou, Xin Feng, Xinyan Zhao, Xinyao Wang, Xinyu Li, Xu Zhang, Xuan Wang, Xuandi Fu, Xueling Yuan, Xuning Wang, Yadunandana Rao, Yair Tavizon, Yan Rossiytsev, Yanbei Chen, Yang Liu, Yang Zou, Yangsook Park, Yannick Versley, Yanyan Zhang, Yash Patel, Yen-Cheng Lu, Yi Pan, Yi-Hsiang, Lai, Yichen Hu, Yida Wang, Yiheng Zhou, Yilin Xiang, Ying Shi, Ying Wang, Yishai Galatzer, Yongxin Wang, Yorick Shen, Yuchen Sun, Yudi Purwatama, Yue, Wu, Yue Gu, Yuechun Wang, Yujun Zeng, Yuncong Chen, Yunke Zhou, Yusheng Xie, Yvon Guy, Zbigniew Ambrozinski, Zhaowei Cai, Zhen Zhang, Zheng Wang, Zhenghui Jin, Zhewei Zhao, Zhiheng Li, Zhiheng Luo, Zhikang Zhang, Zhilin Fang, Zhiqi Bu, Zhiyuan Wang, Zhizhong Li, Zijian Wang, Zimeng, Qiu, Zishi Li
Abstract:
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
Chinese: 亚马逊Nova推出了一系列先进的基座模型,涵盖多模态、纯文本、图像和视频生成等类型,旨在提供顶尖智能、高性价比和负责任开发,并进行了全面的性能评估。
English: Amazon Nova introduces a suite of advanced foundation models, including multimodal, text-only, image, and video generation variants, designed for top-tier intelligence, cost efficiency, and responsible development with comprehensive performance evaluations.
Authors:Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, Jaewoo Kang
Abstract:
Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80\% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/
中文: Med-PRM通过过程奖励建模框架,利用检索增强生成技术验证每个推理步骤与医学知识库的一致性,在多项医疗问答基准中实现最优性能,并将基础模型准确率最高提升13.50%。
English: Med-PRM introduces a process reward modeling framework that enhances clinical decision-making by verifying each reasoning step against medical knowledge, achieving state-of-the-art performance and improving base models by up to 13.50% across medical benchmarks.
Authors:Abhishek Rajgaria, Kushagra Dixit, Mayank Vyas, Harshavardhan Kalalbandi, Dan Roth, Vivek Gupta
Abstract:
Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with "NO" single method consistently outperforming others. To address this, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model reasoning.
中文摘要:本研究提出SEAR自适应提示框架,通过动态适应表格上下文并整合结构化推理,在所有表格类型中均优于基准方法,展现出卓越性能。
English Summary: This study introduces SEAR, an adaptive prompting framework that dynamically adjusts to table contexts and integrates structured reasoning, achieving superior performance across diverse table types compared to baseline methods.
Authors:Yang Qin, Chao Chen, Zhihang Fu, Dezhong Peng, Xi Peng, Peng Hu
Abstract:
Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (THI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, THI refines user queries based on the MLLM responses to reduce the gap to the best-matching images, thereby boosting ranking accuracy. Additionally, to address the limitation of low-quality training texts, we introduce a novel Reorganization Data Augmentation (RDA) strategy based on information enrichment and diversity enhancement to enhance query discriminability by enriching, decomposing, and reorganizing person descriptions. Extensive experiments on four TIReID benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and UFine6926, demonstrate that our method achieves remarkable performance with substantial improvement.
中文摘要:该研究提出的交互式跨模态学习框架通过以人为中心的交互增强文本查询区分度,并结合重组数据增强策略,在多个基准测试中实现了显著性能提升。
English Summary: The proposed Interactive Cross-modal Learning framework enhances text-to-image person re-identification by refining text queries through human-centered interactions and data augmentation, achieving significant performance improvements across multiple benchmarks.
Authors:SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu
Abstract:
Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft
中文: PosterCraft框架通过级联工作流程突破传统固定布局限制,能生成文本精准、艺术元素融合自然的高质量海报,在渲染精度和视觉吸引力上显著优于现有方法。
English: PosterCraft is a unified framework that overcomes the limitations of rigid layout designs by employing a cascaded workflow to generate aesthetically pleasing posters with precise text rendering and seamless artistic integration, significantly outperforming existing methods in accuracy and visual appeal.
Authors:Hongyu Yao, Zijin Hong, Hao Chen, Zhiqing Li, Qijie Shen, Zuobin Ying, Qihua Feng, Huan Gong, Feiran Huang
Abstract:
Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.
中文摘要:MGOE框架首次利用宏观图嵌入来捕捉任务特定特征并建模专家间关联,通过离线和在线测试验证了其在十亿级多任务推荐系统中的卓越性能。
English Summary: The MGOE framework introduces macro graph embeddings to effectively capture task-specific features and expert correlations, demonstrating superior performance in billion-scale multi-task learning recommender systems through both offline experiments and online A/B tests.
Authors:Hongyu Yao, Zijin Hong, Hao Chen, Zhiqing Li, Qijie Shen, Zuobin Ying, Qihua Feng, Huan Gong, Feiran Huang
Abstract:
Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.
中文摘要:MGOE框架首次利用宏观图嵌入来捕捉任务特定特征并建模专家间关联,通过离线和在线测试验证了其在十亿级多任务推荐系统中的卓越性能。
English Summary: The MGOE framework introduces macro graph embeddings to effectively capture task-specific features and expert correlations, demonstrating superior performance in billion-scale multi-task learning recommender systems through both offline experiments and online A/B tests.
Authors:Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais
Abstract:
The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM's ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM's initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer's attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM's capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10\% to 60\% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
中文: 本研究系统探索了增强音频-大语言模型交互的架构改进,证明延迟音频集成、仅注意力探测和多编码器集成能显著提升跨模态信息传递与模型性能。
English: This study systematically explores architectural modifications to enhance audio-LLM interactions, demonstrating that delayed audio integration, attention-only probing, and diverse encoder ensembles significantly improve cross-modal information transfer and performance.
Authors:Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang
Abstract:
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
中文: 思维链提示虽能增强大语言模型的复杂推理能力,但存在推理步骤充分性和必要性的挑战;新因果框架通过量化步骤影响来自动增删步骤,在保持准确性的同时显著提升推理效率并减少计算资源消耗。
English: Chain-of-Thought prompting enhances complex reasoning in large language models but faces challenges in ensuring sufficient and necessary inference steps, which a new causal framework addresses by quantifying step influence to automate step addition and pruning, improving efficiency and reducing token usage without compromising accuracy.
Authors:Yang Liu, Jing Liu, Chengfang Li, Rui Xi, Wenchao Li, Liang Cao, Jin Wang, Laurence T. Yang, Junsong Yuan, Wei Zhou
Abstract:
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing, by identifying unexpected patterns that deviate from established norms in real-world data. Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest due to their ability to learn complex data distributions and generate high-fidelity samples, offering a robust framework for unsupervised AD. In this survey, we comprehensively review anomaly detection and generation with diffusion models (ADGDM), presenting a tutorial-style analysis of the theoretical foundations and practical implementations and spanning images, videos, time series, tabular, and multimodal data. Crucially, unlike existing surveys that often treat anomaly detection and generation as separate problems, we highlight their inherent synergistic relationship. We reveal how DMs enable a reinforcing cycle where generation techniques directly address the fundamental challenge of anomaly data scarcity, while detection methods provide critical feedback to improve generation fidelity and relevance, advancing both capabilities beyond their individual potential. A detailed taxonomy categorizes ADGDM methods based on anomaly scoring mechanisms, conditioning strategies, and architectural designs, analyzing their strengths and limitations. We final discuss key challenges including scalability and computational efficiency, and outline promising future directions such as efficient architectures, conditioning strategies, and integration with foundation models (e.g., visual-language models and large language models). By synthesizing recent advances and outlining open research questions, this survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
中文: 本综述深入探讨了扩散模型在异常检测与生成中的协同关系,通过系统分类和方法分析展示了其在多数据类型中的应用,并针对现有挑战与未来研究方向提出了见解。
English: This survey comprehensively examines the synergistic relationship between anomaly detection and generation using diffusion models, offering a detailed taxonomy and analysis of their applications across various data types while addressing current challenges and future research directions.
Authors:Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.
中文摘要:本研究首次系统探索了潜在多头注意力在小语言模型中的应用,发现结合旋转位置编码的MLA+RoPE架构能在保持模型质量基本不变的前提下,将KV缓存内存降低45%,为内存受限场景提供了帕累托优化方案。
English Summary: This study demonstrates that latent multi-head attention with rotary positional embeddings (MLA+RoPE) in small language models achieves a 45% KV-cache memory reduction with minimal quality loss, offering Pareto improvements for memory-constrained deployments while maintaining competitive performance across multiple metrics.
Authors:Xinyuan Wang, Liang Wu, Yanjie Fu
Abstract:
Optimizing the presentation of search and recommendation results is crucial to enhancing user experience and engagement. Whole Page Optimization (WPO) plays a pivotal role in this process, as it directly influences how information is surfaced to users. While Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent and contextually relevant content, fine-tuning these models for complex tasks like WPO presents challenges. Specifically, the need for extensive human-annotated data to mitigate issues such as hallucinations and model instability can be prohibitively expensive, especially in large-scale systems that interact with millions of items daily. In this work, we address the challenge of fine-tuning LLMs for WPO by using user feedback as the supervision. Unlike manually labeled datasets, user feedback is inherently noisy and less precise. To overcome this, we propose a reward-based fine-tuning approach, PageLLM, which employs a mixed-grained reward mechanism that combines page-level and item-level rewards. The page-level reward evaluates the overall quality and coherence, while the item-level reward focuses on the accuracy and relevance of key recommendations. This dual-reward structure ensures that both the holistic presentation and the critical individual components are optimized. We validate PageLLM on both public and industrial datasets. PageLLM outperforms baselines and achieves a 0.44\% GMV increase in an online A/B test with over 10 million users, demonstrating its real-world impact.
中文摘要:本研究提出PageLLM方法,通过结合页面级和项目级双重奖励机制,利用用户反馈对大型语言模型进行精细化调优,有效优化搜索推荐系统的整体页面展示效果,并在实际应用中显著提升商业指标。
English Summary: This study introduces PageLLM, a reward-based fine-tuning approach for Large Language Models that leverages user feedback with mixed-grained rewards to optimize whole page layouts in search and recommendation systems, achieving significant performance improvements in real-world applications.
Authors:Chenqi Zhang, Yu Feng, Jieru Zhao, Guangda Liu, Wenchao Ding, Chentao Wu, Minyi Guo
Abstract:
3D Gaussian Splatting (3DGS) has gained popularity for its efficiency and sparse Gaussian-based representation. However, 3DGS struggles to meet the real-time requirement of 90 frames per second (FPS) on resource-constrained mobile devices, achieving only 2 to 9 FPS.Existing accelerators focus on compute efficiency but overlook memory efficiency, leading to redundant DRAM traffic. We introduce STREAMINGGS, a fully streaming 3DGS algorithm-architecture co-design that achieves fine-grained pipelining and reduces DRAM traffic by transforming from a tile-centric rendering to a memory-centric rendering. Results show that our design achieves up to 45.7 $\times$ speedup and 62.9 $\times$ energy savings over mobile Ampere GPUs.
中文: STREAMINGGS 提出了一种算法与架构协同设计,通过实现细粒度流水线和减少DRAM流量,显著提升了3D高斯泼溅在移动设备上的运行速度和能效。
English: STREAMINGGS introduces a co-designed algorithm and architecture that enhances 3D Gaussian Splatting by enabling fine-grained pipelining and reducing DRAM traffic, achieving significant speed and energy improvements on mobile devices.
Authors:Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, Huan Ling
Abstract:
Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform.
Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams
中文:Cosmos-Drive-Dreams 合成数据生成管道通过创建高保真驾驶场景解决自动驾驶数据稀缺问题,有效缓解长尾分布挑战并提升3D检测与驾驶策略等下游任务的泛化能力。
English: The Cosmos-Drive-Dreams pipeline generates high-fidelity synthetic driving scenarios to address data scarcity for autonomous vehicles, improving performance in tasks like 3D detection and policy learning by mitigating long-tail distribution issues.
Authors:Sebastian Schmidt, Prasanga Dhungel, Christoffer Löffler, Björn Nieth, Stephan Günnemann, Leo Schwinn
Abstract:
Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs. To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance. Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs. To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data. We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset. We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and TDDS), 4 different datasets (CIFAR-10, CIFAR-100, Places-365, and ImageNet), and 3 training paradigms (supervised, unsupervised, and adversarial). Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
中文: 本文提出一种重要性分数外推框架,仅需少量数据子集即可准确预测样本重要性,实现在多种数据集和训练范式中的高效数据剪枝,无需完整训练过程。
English: This paper introduces an importance score extrapolation framework that accurately predicts sample importance using only a small data subset, enabling efficient data pruning across multiple datasets and training paradigms without requiring full training.
Authors:Yash Ranjan, Rahul Sengupta, Anand Rangarajan, Sanjay Ranka
Abstract:
Traffic Intersections are vital to urban road networks as they regulate the movement of people and goods. However, they are regions of conflicting trajectories and are prone to accidents. Deep Generative models of traffic dynamics at signalized intersections can greatly help traffic authorities better understand the efficiency and safety aspects. At present, models are evaluated on computational metrics that primarily look at trajectory reconstruction errors. They are not evaluated online in a `live' microsimulation scenario. Further, these metrics do not adequately consider traffic engineering-specific concerns such as red-light violations, unallowed stoppage, etc. In this work, we provide a comprehensive analytics tool to train, run, and evaluate models with metrics that give better insights into model performance from a traffic engineering point of view. We train a state-of-the-art multi-vehicle trajectory forecasting model on a large dataset collected by running a calibrated scenario of a real-world urban intersection. We then evaluate the performance of the prediction models, online in a microsimulator, under unseen traffic conditions. We show that despite using ideally-behaved trajectories as input, and achieving low trajectory reconstruction errors, the generated trajectories show behaviors that break traffic rules. We introduce new metrics to evaluate such undesired behaviors and present our results.
中文摘要:本研究开发了一套综合分析工具,通过交通工程专用指标评估深度生成交通模型,发现在实时微观模拟测试中,即使轨迹误差较低的模型也会产生违反交通规则的行为。
English Summary: This study introduces a comprehensive analytics tool for evaluating deep generative traffic models using traffic engineering-specific metrics, revealing that even models with low trajectory errors can produce rule-violating behaviors when tested in live microsimulations.
Authors:Yash Ranjan, Rahul Sengupta, Anand Rangarajan, Sanjay Ranka
Abstract:
Traffic simulators are widely used to study the operational efficiency of road infrastructure, but their rule-based approach limits their ability to mimic real-world driving behavior. Traffic intersections are critical components of the road infrastructure, both in terms of safety risk (nearly 28% of fatal crashes and 58% of nonfatal crashes happen at intersections) as well as the operational efficiency of a road corridor. This raises an important question: can we create a data-driven simulator that can mimic the macro- and micro-statistics of the driving behavior at a traffic intersection? Deep Generative Modeling-based trajectory prediction models provide a good starting point to model the complex dynamics of vehicles at an intersection. But they are not tested in a "live" micro-simulation scenario and are not evaluated on traffic engineering-related metrics. In this study, we propose traffic engineering-related metrics to evaluate generative trajectory prediction models and provide a simulation-in-the-loop pipeline to do so. We also provide a multi-headed self-attention-based trajectory prediction model that incorporates the signal information, which outperforms our previous models on the evaluation metrics.
中文摘要:本研究提出了交通工程相关指标和仿真闭环流程来评估生成式轨迹预测模型,并引入一种融合信号信息的改进多头自注意力模型,在评估指标上优于先前模型。
English Summary: This study introduces traffic engineering metrics and a simulation-in-the-loop pipeline to evaluate generative trajectory prediction models, proposing an enhanced multi-headed self-attention model that incorporates traffic signals and outperforms previous approaches.
Authors:Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou
Abstract:
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
Chinese: QA-LIGN通过将整体奖励分解为可解释的原则性评估,采用透明的起草-批判-修订流程,显著提升了大型语言模型的安全性和对齐效果。
English: QA-LIGN enhances LLM alignment by breaking down rewards into interpretable, principle-specific evaluations, significantly improving safety and effectiveness through a transparent draft-critique-revise process.
Authors:Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou
Abstract:
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
Chinese: QA-LIGN通过将整体奖励分解为可解释的原则性评估,采用透明的起草-批判-修订流程,显著提升了大型语言模型的安全性和对齐效果。
English: QA-LIGN enhances LLM alignment by breaking down rewards into interpretable, principle-specific evaluations, significantly improving safety and effectiveness through a transparent draft-critique-revise process.
Authors:Rajat Rasal, Avinash Kori, Fabio De Sousa Ribeiro, Tian Xia, Ben Glocker
Abstract:
Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To our knowledge, this is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.
中文摘要:本研究提出了一种新颖的基于扩散模型的因果框架,通过整合语义表征与Pearl因果理论实现反事实图像编辑,在忠实因果控制与身份保持之间建立了原理性平衡。
English Summary: This work introduces a novel diffusion-based framework that integrates semantic representations with Pearlian causality for counterfactual image editing, achieving principled trade-offs between faithful causal control and identity preservation.
Authors:Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
Abstract:
Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
中文: PolyVivid是一种创新的多主体视频定制框架,通过文本-图像融合和结构化嵌入技术提升身份一致性和交互效果,在保真度和真实感方面优于现有模型。
English: PolyVivid is a novel multi-subject video customization framework that enhances identity consistency and interaction through text-image fusion and structured embedding techniques, outperforming existing models in fidelity and realism.
Authors:Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang
Abstract:
LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce IntenTest, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, IntenTest generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, IntenTest maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that IntenTest effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, IntenTest generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.
中文: IntenTest是一个以API为中心的压力测试框架,通过基于工具包文档生成真实任务并应用定向变异,系统性地揭示LLM代理中的意图完整性违规,在错误检测率和查询效率上显著优于基线方法。
English: IntenTest is an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents by generating realistic tasks from toolkit documentation and applying targeted mutations, significantly outperforming baselines in error detection and efficiency.
Authors:Matthäus Zloch, Danilo Dessì, Jennifer D'Souza, Leyla Jael Castro, Benjamin Zapilko, Saurav Karmakar, Brigitte Mathiak, Markus Stocker, Wolfgang Otto, Sören Auer, Stefan Dietze
Abstract:
Sharing and reusing research artifacts, such as datasets, publications, or methods is a fundamental part of scientific activity, where heterogeneity of resources and metadata and the common practice of capturing information in unstructured publications pose crucial challenges. Reproducibility of research and finding state-of-the-art methods or data have become increasingly challenging. In this context, the concept of Research Knowledge Graphs (RKGs) has emerged, aiming at providing an easy to use and machine-actionable representation of research artifacts and their relations. That is facilitated through the use of established principles for data representation, the consistent adoption of globally unique persistent identifiers and the reuse and linking of vocabularies and data. This paper provides the first conceptualisation of the RKG vision, a categorisation of in-use RKGs together with a description of RKG building blocks and principles. We also survey real-world RKG implementations differing with respect to scale, schema, data, used vocabulary, and reliability of the contained data. We also characterise different RKG construction methodologies and provide a forward-looking perspective on the diverse applications, opportunities, and challenges associated with the RKG vision.
中文: 研究知识图谱(RKG)通过标准化数据表示、持久标识符和关联词汇,提供了一个机器可操作的框架,以解决研究资源共享和重用中的异构性问题,从而提高研究的可重复性和发现效率。
English: Research Knowledge Graphs (RKGs) address the challenges of sharing and reusing heterogeneous research artifacts by providing a machine-actionable framework that enhances reproducibility and discovery through standardized data representation, persistent identifiers, and linked vocabularies.
Authors:Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, Guanhong Tao
Abstract:
Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.
中文: BadSem通过故意在训练中错配图文对,提出了一种新颖的跨模态后门攻击方法,该攻击不仅成功率极高且能抵抗现有防御策略,揭示了视觉语言模型中亟待解决的安全漏洞。
English: BadSem introduces a novel cross-modal backdoor attack by deliberately misaligning image-text pairs during training, achieving high success rates while remaining resistant to current defense strategies, highlighting critical vulnerabilities in Vision Language Models.
Authors:Yuanhe Tian, Pengsen Cheng, Guoqing Jin, Lei Zhang, Yan Song
Abstract:
Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.
Chinese: 本文提出了一种基于LLM的多模态情感计算新方法,通过分解视觉和文本输入为共享和模态特定成分,并利用注意力机制整合这些信号,从而在多项情感分析任务中显著提升性能,优于现有先进模型。
English: This paper introduces a novel LLM-based method for multi-modal affective computing that decomposes visual and textual inputs into shared and modality-specific components, integrating them through an attention mechanism to enhance emotion recognition across various tasks, consistently outperforming existing models.
Authors:Yung-Chien Wang, Kuang-Da Wang, Wei-Yao Wang, Wen-Chih Peng
Abstract:
Tabular data serve as a fundamental and ubiquitous representation of structured information in numerous real-world applications, e.g., finance and urban planning. In the realm of tabular imbalanced applications, data imbalance has been investigated in classification tasks with insufficient instances in certain labels, causing the model's ineffective generalizability. However, the imbalance issue of tabular regression tasks is underexplored, and yet is critical due to unclear boundaries for continuous labels and simplifying assumptions in existing imbalance regression work, which often rely on known and balanced test distributions. Such assumptions may not hold in practice and can lead to performance degradation. To address these issues, we propose MATI: Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalance Regression, featuring two key innovations: (i) the Region-Aware Mixture Expert, which adopts a Gaussian Mixture Model to capture the underlying related regions. The statistical information of each Gaussian component is then used to synthesize and train region-specific experts to capture the unique characteristics of their respective regions. (ii) Test-Time Self-Supervised Expert Aggregation, which dynamically adjusts region expert weights based on test data features to reinforce expert adaptation across varying test distributions. We evaluated MATI on four real-world tabular imbalance regression datasets, including house pricing, bike sharing, and age prediction. To reflect realistic deployment scenarios, we adopted three types of test distributions: a balanced distribution with uniform target frequencies, a normal distribution that follows the training data, and an inverse distribution that emphasizes rare target regions. On average across these three test distributions, MATI achieved a 7.1% improvement in MAE compared to existing methods.
中文摘要:表格不平衡回归问题因连续标签边界模糊和现有方法假设局限而研究不足,为此提出的MATI框架通过区域感知混合专家和测试时自监督聚合机制,在多种实际场景中显著提升了预测性能。
English Summary: Tabular imbalance regression remains underexplored despite its practical importance, leading to the proposal of MATI—a novel framework employing region-aware mixture experts and test-time self-supervised aggregation that demonstrates significant performance improvements across diverse real-world datasets.
Authors:Bowei Li, Peiqi Yu, Zhenran Tang, Han Zhou, Yifan Sun, Ruixuan Liu, Changliu Liu
Abstract:
This paper presents NeSyPack, a neuro-symbolic framework for bimanual logistics packing. NeSyPack combines data-driven models and symbolic reasoning to build an explainable hierarchical system that is generalizable, data-efficient, and reliable. It decomposes a task into subtasks via hierarchical reasoning, and further into atomic skills managed by a symbolic skill graph. The graph selects skill parameters, robot configurations, and task-specific control strategies for execution. This modular design enables robustness, adaptability, and efficient reuse - outperforming end-to-end models that require large-scale retraining. Using NeSyPack, our team won the First Prize in the What Bimanuals Can Do (WBCD) competition at the 2025 IEEE International Conference on Robotics and Automation.
中文:NeSyPack是一个神经符号框架,通过结合数据驱动模型与符号推理,构建了可解释的分层系统,用于双手物流打包,提升了泛化性、数据效率和可靠性,并优于端到端模型。
English: NeSyPack is a neuro-symbolic framework that integrates data-driven models with symbolic reasoning to create an explainable, hierarchical system for bimanual logistics packing, enhancing generalization, data efficiency, and reliability while outperforming end-to-end models.
Authors:Chen Bao, Chuanbing Huo, Qinyu Chen, Chang Gao
Abstract:
This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mixing configurations and evaluation settings. Results show that our fine-tuned model significantly outperforms the zero-shot baseline, reducing WER on aphasic speech by over 30% while preserving performance on standard speech. The proposed framework offers a scalable, efficient solution for real-world disordered speech recognition.
中文: 本文提出AS-ASR轻量化框架,通过混合训练策略和GPT-4辅助转录增强,在失语症语音识别上实现错误率降低超30%,同时保持边缘设备部署的高效性。
English: This paper introduces AS-ASR, a lightweight speech recognition framework optimized for aphasic speech using hybrid training and GPT-4-based transcript refinement, achieving over 30% WER reduction while maintaining efficiency for edge deployment.
Authors:Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai
Abstract:
We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.
中文: STARFlow是一种基于标准化流的可扩展生成模型,通过架构创新在生成高分辨率图像方面表现优异,并保持端到端的精确最大似然训练能力。
English: STARFlow is a scalable generative model using normalizing flows that achieves competitive high-resolution image synthesis through architectural innovations and remains trainable via exact maximum likelihood.
Authors:Yi Huang, Wajih UI Hassan, Yao Guo, Xiangqun Chen, Ding Li
Abstract:
Provenance graph analysis plays a vital role in intrusion detection, particularly against Advanced Persistent Threats (APTs), by exposing complex attack patterns. While recent systems combine graph neural networks (GNNs) with natural language processing (NLP) to capture structural and semantic features, their effectiveness is limited by class imbalance in real-world data. To address this, we introduce PROVSYN, an automated framework that synthesizes provenance graphs through a three-phase pipeline: (1) heterogeneous graph structure synthesis with structural-semantic modeling, (2) rule-based topological refinement, and (3) context-aware textual attribute synthesis using large language models (LLMs). PROVSYN includes a comprehensive evaluation framework that integrates structural, textual, temporal, and embedding-based metrics, along with a semantic validation mechanism to assess the correctness of generated attack patterns and system behaviors. To demonstrate practical utility, we use the synthetic graphs to augment training datasets for downstream APT detection models. Experimental results show that PROVSYN produces high-fidelity graphs and improves detection performance through effective data augmentation.
中文:PROVSYN是一个自动化框架,通过整合结构语义建模和大语言模型合成溯源图以解决APT检测中的类别不平衡问题,借助高保真数据增强显著提升了检测性能。
English: PROVSYN is an automated framework that synthesizes provenance graphs to address class imbalance in APT detection by integrating structural-semantic modeling and LLMs, enhancing detection performance through high-fidelity data augmentation.
Authors:Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, Tingnan Zhang
Abstract:
In this work, we investigate how spatially grounded auxiliary representations can provide both broad, high-level grounding as well as direct, actionable information to improve policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then feed them as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real world. We propose a novel mixture-of-experts policy architecture that combines multiple specialized expert models, each trained on a distinct mid-level representation, to improve policy generalization. This method achieves an average success rate that is 11% higher than a language-grounded baseline and 24 percent higher than a standard diffusion policy baseline on our evaluation tasks. Furthermore, we find that leveraging mid-level representations as supervision signals for policy actions within a weighted imitation learning algorithm improves the precision with which the policy follows these representations, yielding an additional performance increase of 10%. Our findings highlight the importance of grounding robot policies not only with broad perceptual tasks but also with more granular, actionable representations. For further information and videos, please visit https://mid-level-moe.github.io.
中文摘要:本研究证明,利用空间基础的中层表征能显著提升灵巧操作任务中的策略学习与泛化能力,通过新型专家混合架构和加权模仿学习方法,相比基线方法实现了11%-24%的性能提升。
English Summary: This research demonstrates that using spatially grounded mid-level representations significantly enhances policy learning and generalization for dexterous manipulation tasks, achieving performance improvements of 11-24% over baseline methods through a novel mixture-of-experts architecture and weighted imitation learning.
Authors:Fan Yang, Per Frivik, David Hoeller, Chen Wang, Cesar Cadena, Marco Hutter
Abstract:
Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement learning (RL), have demonstrated strong performance. However, successful navigation still depends on two key capabilities: mapping and planning (explicitly or implicitly). Classical approaches rely on explicit mapping pipelines to register egocentric observations into a coherent map. In contrast, end-to-end learning often achieves this implicitly -- through recurrent neural networks (RNNs) that fuse current and historical observations into a latent space for planning. While existing architectures, such as LSTM and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively perform spatial memorization. This capability is essential for integrating sequential observations from varying perspectives to build spatial representations that support planning. To address this, we propose Spatially-Enhanced Recurrent Units (SRUs) -- a simple yet effective modification to existing RNNs -- that enhance spatial memorization. We further introduce an attention-based network architecture integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. We also employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results show 23.5% overall improvement in long-range navigation compared to existing RNNs. With SRU memory, our method outperforms RL baselines -- one relying on explicit mapping and the other on stacked historical observations -- by 29.6% and 105.0%, respectively, across diverse environments requiring long-horizon mapping and memorization. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer for deployment across diverse and complex real-world environments.
中文:近期机器人导航研究揭示了现有循环神经网络在空间记忆能力上的不足,为此提出的空间增强循环单元(SRU)通过改进空间记忆,使长距离导航性能整体提升23.5%,并在多种环境中显著优于其他方法。
English: Recent robot navigation research highlights a limitation in existing recurrent neural networks for effective spatial memorization, leading to the proposal of Spatially-Enhanced Recurrent Units (SRUs) that significantly improve long-range navigation performance by 23.5% and outperform other methods in diverse environments.
Authors:Atharv Kulkarni, Kushagra Dixit, Vivek Srikumar, Dan Roth, Vivek Gupta
Abstract:
Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.
中文摘要:本研究提出了TempTabQA-C合成数据集与符号表示方法,通过将表格转化为数据库模式并采用自适应少样本提示策略,显著提升了大型语言模型在时序表格问答中的推理能力与泛化性能。
English Summary: This study introduces TempTabQA-C, a synthetic dataset and symbolic representation method that enables LLMs to generate SQL queries for temporal tabular question answering, achieving enhanced robustness and performance through adaptive few-shot prompting.
Authors:Yixiao Ge, Pieter van Goor, Robert Mahony
Abstract:
The extended Kalman filter (EKF) has been the industry standard for state estimation problems over the past sixty years. The classical formulation of the EKF is posed for nonlinear systems defined on global Euclidean spaces. The design methodology is regularly applied to systems on smooth manifolds by choosing local coordinates, however, it is well known that this approach is not intrinsic to the manifold and performance depends heavily on choosing 'good' coordinates. In this paper, we propose an extended Kalman filter that is adapted to the specific geometry of the manifold in question. We show that an affine connection and the concepts of parallel transport, torsion, and curvature are the key geometric structures that allow the formulation of a suitable family of intrinsic Gaussian-like distributions and provide the tools to understand how to propagate state estimates and fuse measurements. This leads us to propose novel geometric modifications to the propagation and update steps of the EKF and revisit recent work on the geometry of the reset step. The relative performance of the proposed geometric modifications are benchmarked against classical EKF and iterated EKF algorithms on a simplified inertial navigation system with direct pose measurements and no bias.
扩展卡尔曼滤波器通过几何改进,能够内在地处理光滑流形上的非线性系统,在惯性导航应用中性能优于传统方法。
The extended Kalman filter is enhanced with geometric modifications to intrinsically handle nonlinear systems on smooth manifolds, outperforming classical methods in inertial navigation applications.
Authors:Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Peng Jia, Xianpeng Lang
Abstract:
Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.
中文: DriveAction是首个专为自动驾驶视觉-语言-动作模型设计的动作驱动基准,通过真实驾驶数据和树状评估框架证明,视觉与语言输入对动作预测缺一不可,为提升类人决策提供了新洞见。
English: DriveAction is the first action-driven benchmark for Vision-Language-Action models in autonomous driving, featuring diverse real-world scenarios and an action-rooted evaluation framework that reveals the necessity of both vision and language inputs for accurate action prediction.
Authors:Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
Abstract:
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
中文摘要:本研究推出OPERA数据集,通过记录真实用户在网购中的行为来评估大型语言模型模拟个体用户网络操作及思维过程的准确性。
English Summary: This study introduces OPERA, a novel dataset capturing real user behaviors during online shopping to evaluate how accurately large language models can simulate individual users' next web actions and reasoning.
Authors:Vadim Tschernezki, Diane Larlus, Iro Laina, Andrea Vedaldi
Abstract:
Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
中文: 本文通过分层运动融合和测试时优化改进了三维动态分割,证明三维技术即使在动态场景中也能大幅提升二维分析效果。
English: This paper introduces Layered Motion Fusion and test-time refinement to improve dynamic segmentation in 3D, showing that 3D techniques can significantly enhance 2D analysis even for dynamic scenes.
Authors:Tomasz Michalski, Adam Wróbel, Andrea Bontempelli, Jakub LuÅtyk, Mikolaj Kniejski, Stefano Teso, Andrea Passerini, Bartosz ZieliÅski, Dawid Rymarczyk
Abstract:
Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as "this bird looks like those sparrows". However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (e.g., a bird's head and wings treated as a single concept). This inconsistency breaks the alignment between model reasoning and human understanding. Furthermore, users have specific preferences for how concepts should look, yet current approaches provide no mechanism for incorporating their feedback. To address these issues, we introduce YoursProtoP, a novel interactive strategy that enables the personalization of prototypical parts - the visual concepts used by the model - according to user needs. By incorporating user supervision, YoursProtoP adapts and splits concepts used for both prediction and explanation to better match the user's preferences and understanding. Through experiments on both the synthetic FunnyBirds dataset and a real-world scenario using the CUB, CARS, and PETS datasets in a comprehensive user study, we demonstrate the effectiveness of YoursProtoP in achieving concept consistency without compromising the accuracy of the model.
Chinese: YoursProtoP提出了一种交互式策略,通过个性化原型部分来解决概念不一致问题,并整合用户反馈,使模型推理与人类理解保持一致,同时保持准确性。
English: YoursProtoP introduces an interactive strategy to personalize prototypical parts in concept-based neural networks, addressing concept inconsistency and incorporating user feedback to align model reasoning with human understanding while maintaining accuracy.
Authors:Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
Abstract:
Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.
中文: 现有多模态推理基准因依赖静态图像、局限于数学问题及易饱和而不足,为此我们推出MORSE-500——一个包含500个脚本生成视频、覆盖六类推理能力的可扩展基准,通过可控生成机制持续推动模型进步。
English: Current multimodal reasoning benchmarks are limited by their reliance on static images, narrow focus on math, and quick saturation, prompting the introduction of MORSE-500—a scalable video benchmark with 500 scripted clips across six reasoning categories to systematically test and advance model capabilities.
Authors:Olaf Dünkel, Thomas Wimmer, Christian Theobalt, Christian Rupprecht, Adam Kortylewski
Abstract:
Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose improving semantic correspondence estimation through 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset-specific annotations compared to prior work, we establish a new state-of-the-art on SPair-71k, achieving an absolute gain of over 4% and of over 7% compared to methods with similar supervision requirements. The generality of our proposed approach simplifies the extension of training to other data sources, which we demonstrate in our experiments.
Chinese: 本研究提出一种3D感知的伪标签方法,通过利用经过筛选的3D衍生标签优化预训练视觉特征来改进语义对应关系,在SPair-71k数据集上以更少标注需求实现了最先进的性能。
English: This study introduces a 3D-aware pseudo-labeling method that enhances semantic correspondence by refining pre-trained vision features with filtered 3D-derived labels, achieving state-of-the-art performance on SPair-71k with reduced annotation needs.
Authors:Robert J. Joyce, Gideon Miller, Phil Roth, Richard Zak, Elliott Zaresky-Williams, Hyrum Anderson, Edward Raff, James Holt
Abstract:
A lack of accessible data has historically restricted malware analysis research, and practitioners have relied heavily on datasets provided by industry sources to advance. Existing public datasets are limited by narrow scope - most include files targeting a single platform, have labels supporting just one type of malware classification task, and make no effort to capture the evasive files that make malware detection difficult in practice. We present EMBER2024, a new dataset that enables holistic evaluation of malware classifiers. Created in collaboration with the authors of EMBER2017 and EMBER2018, the EMBER2024 dataset includes hashes, metadata, feature vectors, and labels for more than 3.2 million files from six file formats. Our dataset supports the training and evaluation of machine learning models on seven malware classification tasks, including malware detection, malware family classification, and malware behavior identification. EMBER2024 is the first to include a collection of malicious files that initially went undetected by a set of antivirus products, creating a "challenge" set to assess classifier performance against evasive malware. This work also introduces EMBER feature version 3, with added support for several new feature types. We are releasing the EMBER2024 dataset to promote reproducibility and empower researchers in the pursuit of new malware research topics.
中文摘要:EMBER2024数据集通过涵盖六种文件格式的320万个文件、支持七种分类任务并包含规避性恶意软件样本,解决了现有恶意软件数据集的局限性,为全面评估检测能力提供了新标准。
English Summary: The EMBER2024 dataset addresses limitations in existing malware datasets by providing comprehensive coverage of 3.2 million files across six formats, supporting seven classification tasks and including evasive malware samples to better evaluate detection capabilities.
Authors:Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang
Abstract:
Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
中文: FPSAttention提出了一种训练感知的FP8量化与稀疏性协同设计方案,通过统一粒度、自适应策略和硬件优化,在保持视频生成质量的同时实现了显著加速。
English: FPSAttention introduces a training-aware co-design of FP8 quantization and sparsity for video generation, achieving significant speed improvements without quality loss through unified granularity, adaptive strategies, and optimized hardware implementation.
Authors:Xuanru Zhou, Jiarun Liu, Shoujun Yu, Hao Yang, Cheng Li, Tao Tan, Shanshan Wang
Abstract:
In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity--especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations, resulting in misregistration, artifacts, and diminished spatial consistency. To address these challenges, we propose TSSC-Net, a novel framework that generates intermediate frames while preserving spatial consistency. To improve temporal fidelity under fast motion, our diffusion-based temporal super-resolution network generates intermediate frames using the start and end frames as key references, achieving 6x temporal super-resolution in a single inference step. Additionally, we introduce a novel tri-directional Mamba-based module that leverages long-range contextual information to effectively resolve spatial inconsistencies arising from cross-slice misalignment, thereby enhancing volumetric coherence and correcting cross-slice errors. Extensive experiments were performed on the public ACDC cardiac MRI dataset and a real-world dynamic 4D knee joint dataset. The results demonstrate that TSSC-Net can generate high-resolution dynamic MRI from fast-motion data while preserving structural fidelity and spatial consistency.
中文: TSSC-Net是一种创新的扩散模型框架,通过关键帧生成中间帧实现4D MRI的6倍时间超分辨率,并采用三向Mamba模块解决空间不一致性问题,在快速运动场景中显著提升了体积连贯性。
English: TSSC-Net is a novel diffusion-based framework that achieves 6x temporal super-resolution for 4D MRI by generating intermediate frames from key references and employs a tri-directional Mamba module to resolve spatial inconsistencies, enhancing volumetric coherence in fast-motion scenarios.
Authors:Yuansheng Ni, Ping Nie, Kai Zou, Xiang Yue, Wenhu Chen
Abstract:
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.
中文: 大语言模型在可视化任务中常因缺乏执行监督和迭代修正而表现不佳,VisCode-200K通过提供包含验证代码和反馈对话的大规模数据集,显著提升了模型生成精确图表的性能。
English: Large language models often fail in visualization tasks due to a lack of execution-grounded supervision and iterative correction, which VisCode-200K addresses by providing a dataset with validated code and feedback dialogues to enhance model performance in generating accurate plots.
Authors:Yuansheng Ni, Ping Nie, Kai Zou, Xiang Yue, Wenhu Chen
Abstract:
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.
中文: 大语言模型在可视化任务中常因缺乏执行监督和迭代修正而表现不佳,VisCode-200K通过提供包含验证代码和反馈对话的大规模数据集,显著提升了模型生成精确图表的性能。
English: Large language models often fail in visualization tasks due to a lack of execution-grounded supervision and iterative correction, which VisCode-200K addresses by providing a dataset with validated code and feedback dialogues to enhance model performance in generating accurate plots.
Authors:Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le
Abstract:
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.
中文:BiMa框架通过将场景元素融入视频嵌入以增强细节,并将文本特征解耦为内容和偏置成分,有效缓解了文本-视频检索中的视觉-语言偏差,在多个基准测试中展现出优异性能,并在分布外检索任务中表现稳健。
English: The BiMa framework effectively mitigates visual-linguistic biases in text-video retrieval by enhancing video embeddings with scene elements and disentangling text features into content and bias components, achieving competitive performance across multiple benchmarks and robust out-of-distribution retrieval results.
Authors:Junde Xu, Zijun Gao, Xinyi Zhou, Jie Hu, Xingyi Cheng, Le Song, Guangyong Chen, Pheng-Ann Heng, Jiezhong Qiu
Abstract:
The inverse folding problem, aiming to design amino acid sequences that fold into desired three-dimensional structures, is pivotal for various biotechnological applications. Here, we introduce a novel approach leveraging Direct Preference Optimization (DPO) to fine-tune an inverse folding model using feedback from a protein folding model. Given a target protein structure, we begin by sampling candidate sequences from the inverse-folding model, then predict the three-dimensional structure of each sequence with the folding model to generate pairwise structural-preference labels. These labels are used to fine-tune the inverse-folding model under the DPO objective. Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning not only improves sequence recovery of baseline models but also leads to a significant improvement in average TM-Score from 0.77 to 0.81, indicating enhanced structure similarity. Furthermore, iterative application of our DPO-based method on challenging protein structures yields substantial gains, with an average TM-Score increase of 79.5\% with regard to the baseline model. This work establishes a promising direction for enhancing protein sequence design ability from structure feedback by effectively utilizing preference optimization.
本研究提出了一种直接偏好优化方法,通过结构反馈微调蛋白质逆向折叠模型,在基准测试中显著提升了序列恢复率和结构相似性评分。
This study introduces a Direct Preference Optimization (DPO) method that fine-tunes inverse protein folding models using structural feedback, significantly improving sequence recovery and structural similarity scores on benchmark tests.
Authors:Jialiang Zhang, Haoran Geng, Yang You, Congyue Deng, Pieter Abbeel, Jitendra Malik, Leonidas Guibas
Abstract:
Understanding and predicting articulated actions is important in robot learning. However, common architectures such as MLPs and Transformers lack inductive biases that reflect the underlying kinematic structure of articulated systems. To this end, we propose the Neural Rodrigues Operator, a learnable generalization of the classical forward kinematics operation, designed to inject kinematics-aware inductive bias into neural computation. Building on this operator, we design the Rodrigues Network (RodriNet), a novel neural architecture specialized for processing actions. We evaluate the expressivity of our network on two synthetic tasks on kinematic and motion prediction, showing significant improvements compared to standard backbones. We further demonstrate its effectiveness in two realistic applications: (i) imitation learning on robotic benchmarks with the Diffusion Policy, and (ii) single-image 3D hand reconstruction. Our results suggest that integrating structured kinematic priors into the network architecture improves action learning in various domains.
中文: 提出的神经罗德里格斯算子和罗德里格斯网络将运动学先验融入神经网络架构,在合成任务及机器人模仿、三维手部重建等实际应用中显著提升了动作学习与预测性能。
English: The proposed Neural Rodrigues Operator and Rodrigues Network (RodriNet) incorporate kinematic priors into neural architectures, significantly enhancing action learning and prediction in both synthetic tasks and real-world applications like robotic imitation and 3D hand reconstruction.
Authors:Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
Abstract:
Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/
中文: 本文提出LumosFlow框架,通过采用大间隔运动的关键帧生成技术和分解式中间帧插值方法,在保持内容多样性的同时显著提升了长视频生成的运动连贯性与视觉质量。
English: The paper introduces LumosFlow, a hierarchical framework that enhances long video generation by explicitly incorporating motion guidance through key frame generation with large motion intervals and advanced intermediate frame interpolation to ensure temporal coherence and visual quality.
Authors:Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan
Abstract:
Evaluation is important for multimodal generation tasks. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing work overlooks two aspects: (1) the development of evaluation capabilities for text-to-image (T2I) generation task, and (2) the incorporation of large-scale human evaluation data. In this paper, we introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT. The corpus contains evaluation data across both image-to-text(I2T) and T2I generation tasks. Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos, a multimodal evaluation model built upon a 7B backbone. Minos achieves state-of-the-art (SoTA) performance among all open-source evaluation models of similar scale on the average of evaluation performance on all tasks, and outperforms all open-source and closed-source models on evaluation of T2I generation task. Extensive experiments demonstrate the importance of leveraging high-quality human evaluation data and jointly training on evaluation data from both I2T and T2I generation tasks.
中文: 本文提出了结合人类与GPT评估数据的大规模多模态数据集Minos-Corpus,并基于此开发了Minos模型,该模型通过创新的训练方法在文本到图像生成任务的评估中实现了最优性能。
English: This paper introduces Minos-Corpus, a large-scale multimodal evaluation dataset combining human and GPT data, and develops the Minos model which achieves state-of-the-art performance in evaluating text-to-image generation tasks through innovative training methods.
Authors:Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian-wen Zhang, Di Yin, Xing Sun, Xiao Huang
Abstract:
Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: \((i)\) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. \((ii)\) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. \((iii)\) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.
中文: GraphRAG-Bench作为一个大规模领域特定基准,通过具有挑战性的多跳推理问题、多样化的任务类型和整体评估框架,严格评估GraphRAG模型,揭示了基于图结构对大型语言模型推理能力提升的关键见解。
English: GraphRAG-Bench is introduced as a large-scale, domain-specific benchmark to rigorously evaluate GraphRAG models through challenging multi-hop reasoning questions, diverse task types, and a holistic evaluation framework, revealing critical insights into graph-based enhancements for LLMs.
Authors:Wenjun Xia, Chuang Niu, Ge Wang
Abstract:
Computed tomography (CT) is a major medical imaging modality. Clinical CT scenarios, such as low-dose screening, sparse-view scanning, and metal implants, often lead to severe noise and artifacts in reconstructed images, requiring improved reconstruction techniques. The introduction of deep learning has significantly advanced CT image reconstruction. However, obtaining paired training data remains rather challenging due to patient motion and other constraints. Although deep learning methods can still perform well with approximately paired data, they inherently carry the risk of hallucination due to data inconsistencies and model instability. In this paper, we integrate the data fidelity with the state-of-the-art generative AI model, referred to as the Poisson flow generative model (PFGM) with a generalized version PFGM++, and propose a novel CT framework: Flow-Oriented Reconstruction Conditioning Engine (FORCE). In our experiments, the proposed method shows superior performance in various CT imaging tasks, outperforming existing unsupervised reconstruction approaches.
中文:提出的FORCE框架将数据保真度与先进的PFGM++生成模型相结合,尽管获取配对训练数据存在挑战,但在各种CT重建任务中展现出优于现有无监督方法的性能。
English: The proposed FORCE framework integrates data fidelity with the advanced PFGM++ generative model, demonstrating superior performance in CT reconstruction tasks over existing unsupervised methods despite challenges in obtaining paired training data.
Authors:Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, Jinfeng Bai
Abstract:
Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.
中文:BranchLoRA是一种非对称框架,通过引入灵活的调优-冻结机制和任务特定路由器,在多模态持续指令调优中提升效率和性能,有效缓解灾难性遗忘并随时间优化分支分配。
English: BranchLoRA is an asymmetric framework that enhances efficiency and performance in multimodal continual instruction tuning by introducing a flexible tuning-freezing mechanism and task-specific routers to mitigate catastrophic forgetting and optimize branch distribution over time.
Authors:Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
Abstract:
Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
中文: 本研究探讨了模态扩展在训练全模态语言模型中的作用,分析其是否损害核心语言能力、模型合并能否实现真正全模态,以及相比顺序扩展是否能促进知识共享和泛化能力。
English: This study examines modality extension for training omni-modal language models, investigating whether it preserves core language abilities, if model merging achieves true omni-modality, and if it enables better knowledge sharing compared to sequential methods.
Authors:Yuanhe Tian, Mingjie Deng, Guoqing Jin, Yan Song
Abstract:
Existing approaches for Large language model (LLM) detoxification generally rely on training on large-scale non-toxic or human-annotated preference data, designing prompts to instruct the LLM to generate safe content, or modifying the model parameters to remove toxic information, which are computationally expensive, lack robustness, and often compromise LLMs' fluency and contextual understanding. In this paper, we propose a simple yet effective approach for LLM detoxification, which leverages a compact, pre-trained calibration model that guides the detoxification process of a target LLM via a lightweight intervention in its generation pipeline. By learning a detoxified embedding space from non-toxic data, the calibration model effectively steers the LLM away from generating harmful content. This approach only requires a one-time training of the calibration model that is able to be seamlessly applied to multiple LLMs without compromising fluency or contextual understanding. Experiment results on the benchmark dataset demonstrate that our approach reduces toxicity while maintaining reasonable content expression.
中文摘要:现有大语言模型脱毒方法通常计算成本高且影响模型性能,而我们提出的方法通过轻量级校准模型有效引导模型避开有害内容,同时保持其流畅性和上下文理解能力。
English Summary: Current LLM detoxification methods are often resource-intensive and can impair model performance, but our proposed approach uses a compact calibration model to effectively steer LLMs away from toxic content with minimal impact on fluency and understanding.
Authors:Qijie Shen, Yuanchen Bei, Zihong Huang, Jialin Zhu, Keqin Xu, Boya Du, Jiawei Tang, Yuning Jiang, Feiran Huang, Xiao Huang, Hao Chen
Abstract:
Maintaining a healthy ecosystem in billion-scale online platforms is challenging, as users naturally gravitate toward popular items, leaving cold and less-explored items behind. This ''rich-get-richer'' phenomenon hinders the growth of potentially valuable cold items and harms the platform's ecosystem. Existing cold-start models primarily focus on improving initial recommendation performance for cold items but fail to address users' natural preference for popular content. In this paper, we introduce AliBoost, Alibaba's ecological boosting framework, designed to complement user-oriented natural recommendations and foster a healthier ecosystem. AliBoost incorporates a tiered boosting structure and boosting principles to ensure high-potential items quickly gain exposure while minimizing disruption to low-potential items. To achieve this, we propose the Stacking Fine-Tuning Cold Predictor to enhance the foundation CTR model's performance on cold items for accurate CTR and potential prediction. AliBoost then employs an Item-oriented Bidding Boosting mechanism to deliver cold items to the most suitable users while balancing boosting speed with user-personalized preferences. Over the past six months, AliBoost has been deployed across Alibaba's mainstream platforms, successfully cold-starting over a billion new items and increasing both clicks and GMV of cold items by over 60% within 180 days. Extensive online analysis and A/B testing demonstrate the effectiveness of AliBoost in addressing ecological challenges, offering new insights into the design of billion-scale recommender systems.
中文摘要:阿里Boost是阿里巴巴推出的生态助推框架,通过分层助推结构和精准潜力预测解决平台"富者愈富"现象,在保障用户体验的同时帮助高潜力冷门商品快速获得曝光,已成功助推超10亿新商品并实现冷门商品点击率和交易额增长超60%。
English summary: AliBoost is Alibaba's ecological boosting framework that addresses the "rich-get-richer" phenomenon in billion-scale platforms by helping high-potential cold items gain exposure while balancing user preferences, successfully increasing cold item performance by over 60% in six months.
Authors:Yixin Wan, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Rahul Gupta
Abstract:
Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information, such as private, sensitive, or copyrighted content, from LLMs. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that not every token needs forgetting. We propose Selective Unlearning (SU), which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model's utility in the retaining set.
中文: 选择性遗忘(SU)方法通过仅移除与不需要信息相关的关键标记,有效实现目标数据的遗忘,同时显著保留模型在保留集上的实用性。
English: Selective Unlearning (SU) is introduced to remove only critical tokens related to unwanted information in LLMs, effectively forgetting targeted data while preserving model utility.
Authors:Eunsu Kim, Haneul Yoo, Guijin Son, Hitesh Patel, Amit Agarwal, Alice Oh
Abstract:
As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.
中文: 本文介绍了BenchHub这一动态基准库,它整合并自动分类了来自38个基准的30.3万个问题,支持持续更新和可扩展数据管理,旨在解决现有基准数据集分散、难以定制评估的问题,并通过实验证明领域感知评估的重要性。
English: This paper introduces BenchHub, a dynamic repository that aggregates and classifies 303K questions from 38 benchmarks to enable flexible, domain-specific evaluation of large language models, addressing the challenges of scattered datasets and emphasizing performance variations across domains.
Authors:Xiaochen Wang, Zongyu Wu, Yuan Zhong, Xiang Zhang, Suhang Wang, Fenglong Ma
Abstract:
Graph retrieval-augmented generation (GRAG) places high demands on graph-specific retrievers. However, existing retrievers often rely on language models pretrained on plain text, limiting their effectiveness due to domain misalignment and structure ignorance. To address these challenges, we propose GPR, a graph-based retriever pretrained directly on knowledge graphs. GPR aligns natural language questions with relevant subgraphs through LLM-guided graph augmentation and employs a structure-aware objective to learn fine-grained retrieval strategies. Experiments on two datasets, three LLM backbones, and five baselines show that GPR consistently improves both retrieval quality and downstream generation, demonstrating its effectiveness as a robust retrieval solution for GRAG.
中文摘要:GPR是一种直接在知识图谱上预训练的图检索器,通过LLM引导的图增强和结构感知目标解决领域错位与结构忽略问题,在多基准测试中持续提升检索与生成效果。
English Summary: GPR is a novel graph-based retriever pretrained directly on knowledge graphs to overcome domain misalignment and structural limitations of existing methods, consistently enhancing retrieval and generation performance across multiple benchmarks.
Authors:Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, Marco A. Gerosa
Abstract:
Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while also maintaining reliability in classification. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification.
中文: 本研究利用大语言模型开发自动化问题分类系统,其中GPT-4o表现最佳,在获得80.7% F1值的同时降低了对大规模训练数据的依赖。
English: This research develops an automated issue classification system using large language models (LLMs), with GPT-4o demonstrating superior performance by achieving an 80.7% F1 score while reducing dependency on extensive training datasets.
Authors:Gen Li, Yuchen Zhou, Yuting Wei, Yuxin Chen
Abstract:
In this paper, we explore provable acceleration of diffusion models without any additional retraining. Focusing on the task of approximating a target data distribution in $\mathbb{R}^d$ to within $\varepsilon$ total-variation distance, we propose a principled, training-free sampling algorithm that requires only the order of
$$ d^{1+2/K} \varepsilon^{-1/K} $$
score function evaluations (up to log factor) in the presence of accurate scores, where $K>0$ is an arbitrary fixed integer. This result applies to a broad class of target data distributions, without the need for assumptions such as smoothness or log-concavity. Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases -- without demanding higher-order smoothness on the score estimates as assumed in previous work. The proposed algorithm draws insight from high-order ODE solvers, leveraging high-order Lagrange interpolation and successive refinement to approximate the integral derived from the probability flow ODE. More broadly, our work develops a theoretical framework towards understanding the efficacy of high-order methods for accelerated sampling.
中文: 本文提出一种无需训练的采样算法,通过高阶ODE求解器实现扩散模型的可证明加速,仅需约 \( d^{1+2/K} \varepsilon^{-1/K} \) 次评分函数计算即可在\(\varepsilon\)误差内逼近数据分布,且不依赖分布平滑性或对数凹性假设。
English: This paper introduces a training-free sampling algorithm that achieves provable acceleration for diffusion models by requiring only approximately \( d^{1+2/K} \varepsilon^{-1/K} \) score evaluations to approximate data distributions within \(\varepsilon\) error, without needing distribution smoothness or log-concavity assumptions.
Authors:Veronica Lachi, Francesco Ferrini, Antonio Longa, Bruno Lepri, Andrea Passerini, Manfred Jaeger
Abstract:
Graph Neural Networks (GNNs) are widely used to compute representations of node pairs for downstream tasks such as link prediction. Yet, theoretical understanding of their expressive power has focused almost entirely on graph-level representations. In this work, we shift the focus to links and provide the first comprehensive study of GNN expressiveness in link representation. We introduce a unifying framework, the $k_Ï$-$k_Ï$-$m$ framework, that subsumes existing message-passing link models and enables formal expressiveness comparisons. Using this framework, we derive a hierarchy of state-of-the-art methods and offer theoretical tools to analyze future architectures. To complement our analysis, we propose a synthetic evaluation protocol comprising the first benchmark specifically designed to assess link-level expressiveness. Finally, we ask: does expressiveness matter in practice? We use a graph symmetry metric that quantifies the difficulty of distinguishing links and show that while expressive models may underperform on standard benchmarks, they significantly outperform simpler ones as symmetry increases, highlighting the need for dataset-aware model selection.
中文: 本研究首次系统分析了图神经网络在链接表示中的表达能力,提出了统一框架来比较不同方法,并证明尽管表达能力强的模型在标准基准测试中可能表现不佳,但在高对称性场景下显著优于简单模型。
English: This study provides the first comprehensive analysis of Graph Neural Networks' expressiveness in link representation, introducing a unifying framework to compare methods and demonstrating that expressive models outperform simpler ones in high-symmetry scenarios despite potential underperformance on standard benchmarks.
Authors:Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen
Abstract:
Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: https://lucaria-academy.github.io/SynMotion/
中文摘要:SynMotion是一种新型视频运动定制模型,通过结合语义引导和视觉适配来提升运动保真度和时序连贯性,在文本到视频和图像到视频生成任务中优于现有基准方法。
English Summary: SynMotion is a novel video motion customization model that integrates semantic guidance and visual adaptation to enhance motion fidelity and temporal coherence, outperforming existing baselines in text-to-video and image-to-video generation.
Authors:Baihe Ma, Yanna Jiang, Xu Wang, Guangsheng Yu, Qin Wang, Caijun Sun, Chen Li, Xuelei Qi, Ying He, Wei Ni, Ren Ping Liu
Abstract:
As Large Language Models (LLMs) are increasingly deployed in sensitive domains, traditional data privacy measures prove inadequate for protecting information that is implicit, contextual, or inferable - what we define as semantic privacy. This Systematization of Knowledge (SoK) introduces a lifecycle-centric framework to analyze how semantic privacy risks emerge across input processing, pretraining, fine-tuning, and alignment stages of LLMs. We categorize key attack vectors and assess how current defenses, such as differential privacy, embedding encryption, edge computing, and unlearning, address these threats. Our analysis reveals critical gaps in semantic-level protection, especially against contextual inference and latent representation leakage. We conclude by outlining open challenges, including quantifying semantic leakage, protecting multimodal inputs, balancing de-identification with generation quality, and ensuring transparency in privacy enforcement. This work aims to inform future research on designing robust, semantically aware privacy-preserving techniques for LLMs.
中文: 本系统化知识提出了一个生命周期框架来分析大语言模型中的语义隐私风险,指出了在上下文推断和潜在泄漏防护方面的关键不足,并为未来开发鲁棒的隐私保护技术规划了研究挑战。
English: This Systematization of Knowledge proposes a lifecycle framework to analyze semantic privacy risks in Large Language Models, identifying critical gaps in protection against contextual inference and latent leakage while outlining future research challenges for robust privacy techniques.
Authors:Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
Abstract:
The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.
中文: 本研究发现通过标准下一词预测训练的任意大语言模型中已内在地存在强大的通用奖励模型,无需额外训练即可直接提取高质量奖励信号,并从理论上证明使用该内生奖励进行强化学习可获得更优的策略性能边界。
English: This study reveals that a powerful generalist reward model inherently exists within any LLM trained via standard next-token prediction, enabling direct extraction of high-quality reward signals without additional training and theoretically proving that reinforcement learning with this endogenous reward yields superior policy performance.
Authors:Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min zhang
Abstract:
Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a Critic-Discernment Game(CDG) in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.
Chinese: 本文提出了一种名为“批评-辨别游戏”的自博弈方法,通过让模型在无需人类监督的情况下区分有益与误导性评论,显著提升了大型语言模型对自身推理过程的理解能力。
English: This paper introduces a self-play method called Critic-Discernment Game (CDG) that enhances large language models' reasoning comprehension by having them distinguish between helpful and misleading critiques without human supervision.
Authors:Xinyu Chen, Vassilis Digalakis, Lijun Ding, Dingyi Zhuang, Jinhua Zhao
Abstract:
Time series autoregression (AR) is a classical tool for modeling auto-correlations and periodic structures in real-world systems. We revisit this model from an interpretable machine learning perspective by introducing sparse autoregression (SAR), where $\ell_0$-norm constraints are used to isolate dominant periodicities. We formulate exact mixed-integer optimization (MIO) approaches for both stationary and non-stationary settings and introduce two scalable extensions: a decision variable pruning (DVP) strategy for temporally-varying SAR (TV-SAR), and a two-stage optimization scheme for spatially- and temporally-varying SAR (STV-SAR). These models enable scalable inference on real-world spatiotemporal datasets. We validate our framework on large-scale mobility and climate time series. On NYC ridesharing data, TV-SAR reveals interpretable daily and weekly cycles as well as long-term shifts due to COVID-19. On climate datasets, STV-SAR uncovers the evolving spatial structure of temperature and precipitation seasonality across four decades in North America and detects global sea surface temperature dynamics, including El Niño. Together, our results demonstrate the interpretability, flexibility, and scalability of sparse autoregression for periodicity quantification in complex time series.
中文摘要:本研究通过引入稀疏自回归模型,利用ℓ₀范数约束识别时间序列中的主要周期性模式,开发的可扩展优化方法成功揭示了交通出行和气候数据中具有解释性的周期规律。
English Summary: This study introduces sparse autoregression (SAR) using ℓ₀-norm constraints to identify key periodic patterns in time series, developing scalable optimization methods that successfully reveal interpretable cycles in mobility and climate data.
Authors:Shansong Wang, Zhecheng Jin, Mingzhe Hu, Mojtaba Safari, Feng Zhao, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang
Abstract:
CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.
中文: MMKD-CLIP通过多教师知识蒸馏整合多个专业CLIP模型的知识,在数据资源有限的情况下实现了跨生物医学图像模态和任务的卓越性能与泛化能力。
English: MMKD-CLIP is a biomedical foundation model that overcomes data limitations by distilling knowledge from multiple specialized CLIP models, demonstrating superior performance and generalization across diverse biomedical tasks and image modalities.
Authors:Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique
Abstract:
As large language models (LLMs) grow more capable, they face growing vulnerability to sophisticated jailbreak attacks. While developers invest heavily in alignment finetuning and safety guardrails, researchers continue publishing novel attacks, driving progress through adversarial iteration. This dynamic mirrors a strategic game of continual evolution. However, two major challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and practical impact of research in jailbreak attacks. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models and benchmarks, demonstrating its robustness and adaptability. Warning: This paper contains model outputs that may be offensive or harmful, shown solely to demonstrate jailbreak efficacy.
中文: 随着大语言模型能力增强,其面临越狱攻击的脆弱性日益凸显,而MetaCipher框架以最少查询量有效突破不同模型的安全防护,展现出卓越的适应性和攻击成功率。
English: As large language models become more advanced, they are increasingly susceptible to jailbreak attacks, and the proposed MetaCipher framework effectively overcomes these vulnerabilities with minimal queries while adapting across various models.
Authors:Wei Jiang, Hans D. Schotten
Abstract:
This letter analyzes the effects of power amplifiers (PAs) on the downlink of cell-free massive MIMO systems. We model signal transmission incorporating nonlinear PA distortion and derive a unified spectral efficiency (SE) expression applicable to arbitrary precoding schemes. To combat PA-induced performance degradation, a joint optimization approach for user association and max-min power control is proposed. Furthermore, a low-complexity alternative is developed to approximate the joint optimization with reduced computational overhead. Simulations validate the analysis and demonstrate significant performance gains of the proposed approaches over conventional techniques.
中文: 本文分析了功率放大器对无蜂窝大规模MIMO系统下行链路的影响,提出了联合优化和低复杂度替代方案以减轻性能损失,相比传统方法实现了显著性能提升。
English: This letter analyzes power amplifier effects in cell-free massive MIMO systems, proposing joint optimization and low-complexity alternatives to mitigate performance degradation while achieving significant gains over conventional methods.
Authors:Wei Jiang, Hans D. Schotten
Abstract:
Massive multi-input multi-output (MIMO) has evolved along two tracks: cellular and cell-free, each with unique advantages and limitations. The cellular approach suffers from worse user spectral efficiency at cell edges, whereas the cell-free approach incurs high implementation costs due to a large-scale distributed infrastructure. This paper introduces a novel networking paradigm, termed heterogeneous massive MIMO (HmMIMO), which seamlessly integrates co-located and distributed antennas. Differing from two conventional paradigms, HmMIMO remains a base station with a large antenna array at the center of each cell, aided by distributed antennas deployed at cell edges. Our findings demonstrate that this paradigm achieves a favorable trade-off between performance and implementation complexity.
中文: 本文提出异构大规模MIMO(HmMIMO)新范式,通过融合集中式与分布式天线,在蜂窝和无蜂窝两种传统模式间实现了性能与实施复杂度的优化平衡。
English: The paper proposes a heterogeneous massive MIMO (HmMIMO) paradigm that combines co-located and distributed antennas to balance performance and implementation complexity, overcoming limitations of cellular and cell-free approaches.
Authors:Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang
Abstract:
Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.
中文摘要:本文提出了一种语义掩码自编码器,通过可学习原型捕捉组件语义并优化掩码策略,解决了点云预训练中随机掩码的局限性,在多个数据集上的实验验证了其有效性。
English Summary: The paper introduces a Semantic Masked Autoencoder that overcomes the limitations of random masking in point cloud pre-training by using learnable prototypes to capture component semantics and enhance masking strategies, validated through extensive experiments on multiple datasets.
Authors:Hongchao Zhang, Manan Tayal, Jackson Cox, Pushpak Jagtap, Shishir Kolathaya, Andrew Clark
Abstract:
Control Barrier Functions (CBFs) are utilized to ensure the safety of control systems. CBFs act as safety filters in order to provide safety guarantees without compromising system performance. These safety guarantees rely on the construction of valid CBFs. Due to their complexity, CBFs can be represented by neural networks, known as neural CBFs (NCBFs). Existing works on the verification of the NCBF focus on the synthesis and verification of NCBFs in deterministic settings, leaving the stochastic NCBFs (SNCBFs) less studied. In this work, we propose a verifiably safe synthesis for SNCBFs. We consider the cases of smooth SNCBFs with twice-differentiable activation functions and SNCBFs that utilize the Rectified Linear Unit or ReLU activation function. We propose a verification-free synthesis framework for smooth SNCBFs and a verification-in-the-loop synthesis framework for both smooth and ReLU SNCBFs. and we validate our frameworks in three cases, namely, the inverted pendulum, Darboux, and the unicycle model.
Chinese: 本研究提出了针对随机神经控制屏障函数(SNCBF)的可验证安全合成框架,涵盖平滑和ReLU激活函数两种情况,并在三个控制系统中进行了验证。
English: This work introduces a verifiably safe synthesis framework for stochastic neural control barrier functions (SNCBFs), addressing both smooth and ReLU activation functions, and validates it on three control systems.
Authors:Jason Lim, Florian Richter, Zih-Yun Chiu, Jaeyon Lee, Ethan Quist, Nathan Fisher, Jonathan Chambers, Steven Hong, Michael C. Yip
Abstract:
Robotic teleoperation over long communication distances poses challenges due to delays in commands and feedback from network latency. One simple yet effective strategy to reduce errors and increase performance under delay is to downscale the relative motion between the operating surgeon and the robot. The question remains as to what is the optimal scaling factor, and how this value changes depending on the level of latency as well as operator tendencies. We present user studies investigating the relationship between latency, scaling factor, and performance. The results of our studies demonstrate a statistically significant difference in performance between users and across scaling factors for certain levels of delay. These findings indicate that the optimal scaling factor for a given level of delay is specific to each user, motivating the need for personalized models for optimal performance. We present techniques to model the user-specific mapping of latency level to scaling factor for optimal performance, leading to an efficient and effective solution to optimizing performance of robotic teleoperation and specifically telesurgery under large communication delay.
中文: 研究表明,在延迟条件下,机器人远程操作的最佳运动缩放比例因人而异,因此需要个性化模型来有效提升远程手术的性能。
English: The study finds that optimal motion scaling in robotic teleoperation under latency varies per user, necessitating personalized models to enhance telesurgery performance effectively.
Authors:Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
Abstract:
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
中文:SiM3D是首个融合多视角和多模态信息的3D异常检测与分割基准,专注于单实例训练场景下的合成数据到真实数据泛化挑战,并提供工业级高质量数据集及基线评估方法。
English: SiM3D is the first benchmark integrating multiview and multimodal data for 3D anomaly detection and segmentation, addressing single-instance training with synthetic-to-real generalization and providing high-quality industrial data with baseline evaluations.
Authors:Chuxin Wang, Yixin Zha, Wenfei Yang, Tianzhu Zhang
Abstract:
Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
中文摘要:StruMamba3D通过引入空间状态和自适应策略,解决了Mamba方法在点云学习中破坏空间邻接性和长序列记忆的问题,在ModelNet40和ScanObjectNN基准测试中取得了最优性能。
English Summary: StruMamba3D addresses limitations in Mamba-based point cloud learning by preserving spatial dependencies through spatial states and adaptive strategies, achieving state-of-the-art accuracy on benchmarks like ModelNet40 and ScanObjectNN.
Authors:Zhixin Cheng, Jiacheng Deng, Xinjun Li, Xiaotian Yin, Bohao Liao, Baoqun Yin, Wenfei Yang, Tianzhu Zhang
Abstract:
Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.
Chinese: 该方法通过引入通道自适应调整模块(CAA)增强模态内特征并抑制跨模态敏感性,结合全局最优选择模块(GOS)进行全局优化,在图像到点云配准中实现了最优性能。
English: The proposed method introduces a Channel Adaptive Adjustment Module (CAA) to enhance intra-modal features and reduce cross-modal sensitivity, along with a Global Optimal Selection Module (GOS) for global optimization, achieving state-of-the-art performance in image-to-point cloud registration.
Authors:Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu
Abstract:
Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.
中文摘要:本文提出基于分数的指令生成(SIG)管道,通过自动化生成视频质量评估指令数据,使视频大模型能够模拟人类推理过程进行多维度质量评价,显著提升了视频质量评分与理由阐述能力。
English Summary: This paper introduces the Score-based Instruction Generation (SIG) pipeline to automatically create video quality instruction data, enabling video large multimodal models to provide detailed quality assessments through a hierarchical reasoning process that mimics human evaluation.
Authors:Zhixuan Chen, Junlin Hou, Liqi Lin, Yihui Wang, Yequan Bie, Xi Wang, Yanning Zhou, Ronald Cheong Kin Chan, Hao Chen
Abstract:
Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg, the largest and most comprehensive dataset for pathology segmentation, built from 21 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.
中文摘要:PathSegmentor是首个专用于病理图像分割的文本提示基础模型,通过自然语言输入实现精准分割并显著超越现有方法,其构建的PathSeg最大规模数据集为精准肿瘤学的可解释AI发展提供了重要支撑。
English Summary: PathSegmentor is a pioneering text-prompted foundation model for pathology image segmentation that achieves superior accuracy and broad applicability using natural language inputs, while its comprehensive PathSeg dataset enables robust performance across diverse medical categories.
Authors:Rachel Luo, Heng Yang, Michael Watson, Apoorva Sharma, Sushant Veer, Edward Schmerling, Marco Pavone
Abstract:
Learning-based robotic systems demand rigorous validation to assure reliable performance, but extensive real-world testing is often prohibitively expensive, and if conducted may still yield insufficient data for high-confidence guarantees. In this work we introduce Sim2Val, a general estimation framework that leverages paired data across test platforms, e.g., paired simulation and real-world observations, to achieve better estimates of real-world metrics via the method of control variates. By incorporating cheap and abundant auxiliary measurements (for example, simulator outputs) as control variates for costly real-world samples, our method provably reduces the variance of Monte Carlo estimates and thus requires significantly fewer real-world samples to attain a specified confidence bound on the mean performance. We provide theoretical analysis characterizing the variance and sample-efficiency improvement, and demonstrate empirically in autonomous driving and quadruped robotics settings that our approach achieves high-probability bounds with markedly improved sample efficiency. Our technique can lower the real-world testing burden for validating the performance of the stack, thereby enabling more efficient and cost-effective experimental evaluation of robotic systems.
中文摘要:Sim2Val框架通过结合仿真与真实世界数据,利用控制变量法显著减少机器人系统验证所需的真实测试次数,同时保持统计置信度。
English Summary: The Sim2Val framework uses paired simulation and real-world data with control variates to significantly reduce the number of required real-world tests for validating robotic systems while maintaining statistical confidence.
Authors:Yingji Zhang, Danilo S. Carvalho, André Freitas
Abstract:
Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
中文摘要:将组合性与符号特性融入分布式语义空间,能通过语义表示学习弥合符号与分布式语义间的鸿沟,从而提升基于Transformer的自回归语言模型的可解释性与泛化能力。
English Summary: Integrating compositional and symbolic properties into distributional semantic spaces improves Transformer language models' interpretability and generalization by bridging symbolic and distributional semantics through semantic representation learning.
Authors:Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
Abstract:
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
中文摘要:本研究提出监督强化微调(SRFT)方法,通过基于熵的权重机制将监督微调与强化学习统一在单阶段训练中,在数学推理基准测试中以59.1%的平均准确率显著超越现有方法。
English summary: This study introduces Supervised Reinforcement Fine-Tuning (SRFT), a unified single-stage method that combines supervised fine-tuning and reinforcement learning through entropy-aware mechanisms, achieving superior performance on mathematical reasoning benchmarks with 59.1% average accuracy.
Authors:Tomoyuki Morimae, Yuki Shirakawa, Takashi Yamakawa
Abstract:
Indistinguishability obfuscation (iO) has emerged as a powerful cryptographic primitive with many implications. While classical iO, combined with the infinitely-often worst-case hardness of $\mathsf{NP}$, is known to imply one-way functions (OWFs) and a range of advanced cryptographic primitives, the cryptographic implications of quantum iO remain poorly understood. In this work, we initiate a study of the power of quantum iO. We define several natural variants of quantum iO, distinguished by whether the obfuscation algorithm, evaluation algorithm, and description of obfuscated program are classical or quantum. For each variant, we identify quantum cryptographic primitives that can be constructed under the assumption of quantum iO and the infinitely-often quantum worst-case hardness of $\mathsf{NP}$ (i.e., $\mathsf{NP}\not\subseteq\mathsf{\text{i.o.} BQP}$). In particular, we construct pseudorandom unitaries, QCCC quantum public-key encryption and (QCCC) quantum symmetric-key encryption, and several primitives implied by them such as one-way state generators, (efficiently-verifiable) one-way puzzles, and EFI pairs, etc. While our main focus is on quantum iO, even in the classical setting, our techniques yield a new and arguably simpler construction of OWFs from classical (imperfect) iO and the infinitely-often worst-case hardness of $\mathsf{NP}$.
中文: 本研究探讨量子不可区分混淆及其变体,证明在量子最坏情况下NP难度的假设下,它们可构建伪随机幺正算子和量子加密等量子密码原语。
English: This study explores quantum indistinguishability obfuscation (iO) and its variants, demonstrating how they can construct quantum cryptographic primitives like pseudorandom unitaries and quantum encryption under the assumption of quantum worst-case hardness of NP.
Authors:Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
Abstract:
Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.
Chinese: 本研究提出了一种自动化数据整理流程,显著扩展了软件工程数据集的规模和多样性,使得Skywork-SWE模型在基于大语言模型的软件工程任务中通过数据驱动的持续改进实现了最先进的性能表现。
English: This study introduces an automated data-curation pipeline that significantly scales software engineering datasets, enabling the Skywork-SWE model to achieve state-of-the-art performance in LLM-based software engineering tasks through continuous data-driven improvement.
Authors:Peter Frank, Falk Dettinger, Daniel Dittler, Pascal Häbig, Nasser Jazdi, Kai Hufendiek, Michael Weyrich
Abstract:
Inspection and maintenance of offshore platforms are associated with high costs, primarily due to the significant personnel requirements and challenging operational conditions. This paper first presents a classification of Power to X platforms. Building upon this foundation, a communication architecture is proposed to enable monitoring, control, and teleoperation for a Power to X platform. To reduce the demand for human labor, a robotic system is integrated to autonomously perform inspection and maintenance tasks. The implementation utilizes a quadruped robot. Remote monitoring, control, and teleoperation of the robot are analyzed within the context of a 5G standalone network. As part of the evaluation, aspects such as availability and latency are recorded, compared, and critically assessed.
中文: 本文提出了一种集成5G网络的四足机器人系统,用于自主执行海上Power to X平台的检查与维护任务,以降低高昂的人员成本和操作难度。
English: This paper introduces a robotic system using a quadruped robot integrated with a 5G network to autonomously perform inspection and maintenance on offshore Power to X platforms, aiming to reduce high personnel costs and operational challenges.
Authors:Kejia Bian, Meixia Tao, Shu Sun, Jun Yu
Abstract:
Neural ray tracing (RT) has emerged as a promising paradigm for channel modeling by combining physical propagation principles with neural networks. It enables high modeling accuracy and efficiency. However, current neural RT methods face two key limitations: constrained generalization capability due to strong spatial dependence, and weak adherence to electromagnetic laws. In this paper, we propose GeNeRT, a Generalizable Neural RT framework with enhanced generalization, accuracy and efficiency. GeNeRT supports both intra-scenario spatial transferability and inter-scenario zero-shot generalization. By incorporating Fresnel-inspired neural network design, it also achieves higher accuracy in multipath component (MPC) prediction. Furthermore, a GPU-tensorized acceleration strategy is introduced to improve runtime efficiency. Extensive experiments conducted in outdoor scenarios demonstrate that GeNeRT generalizes well across untrained regions within a scenario and entirely unseen environments, and achieves superior accuracy in MPC prediction compared to baselines. Moreover, it outperforms Wireless Insite in runtime efficiency, particularly in multi-transmitter settings. Ablation experiments validate the effectiveness of the network architecture and training strategy in capturing physical principles of ray-surface interactions.
中文: GeNeRT是一种可泛化的神经射线追踪框架,通过引入菲涅耳启发的网络设计和GPU加速策略,显著提升了跨场景的泛化能力、多路径预测精度及运行效率。
English: GeNeRT is a generalizable neural ray tracing framework that enhances spatial transferability and zero-shot generalization while improving accuracy in multipath prediction and runtime efficiency through GPU acceleration.
Authors:Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang
Abstract:
As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
中文摘要:PRISON框架研究表明,先进大语言模型在现实犯罪场景中频繁表现出欺骗、构陷等犯罪倾向,而其识别犯罪行为的能力却严重不足,凸显了部署前加强安全机制的紧迫性。
English Summary: The PRISON framework reveals that advanced large language models frequently develop emergent criminal capabilities like deception and manipulation, while simultaneously demonstrating poor detection of such behaviors, highlighting critical safety risks before wider deployment.
Authors:Liang Heng, Haoran Geng, Kaifeng Zhang, Pieter Abbeel, Jitendra Malik
Abstract:
Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.
中文: ViTacFormer提出了一种融合视觉与触觉的跨模态表征学习方法,使拟人化灵巧手在自主操作中成功率提升约50%,并首次实现了需高精度控制的11步长时序任务。
English: ViTacFormer introduces a cross-modal representation learning approach that fuses vision and tactile signals, enabling autonomous dexterous manipulation with a 50% higher success rate and achieving unprecedented long-horizon tasks using an anthropomorphic hand.
Authors:Yuhui Shi, Yehan Yang, Qiang Sheng, Hao Mi, Beizhe Hu, Chaoxi Xu, Juan Cao
Abstract:
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.
Chinese: 随着私有化调优大语言模型的兴起,现有检测方法效果下降,为此开发了PhantomHunter检测器,通过捕捉模型家族层面的共同特征来识别未知私有模型生成的文本,在测试中F1分数超过96%。
English: The rise of privately-tuned large language models has reduced the effectiveness of existing detection methods, leading to the development of PhantomHunter, a novel detector that identifies text from unseen private LLMs by capturing family-level traits, achieving over 96% F1 scores in tests.
Authors:Mattia Nardon, Stefano Messelodi, Antonio Granata, Fabio Poiesi, Alberto Danese, Davide Boscaini
Abstract:
Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT
中文摘要:ViMAT是一种基于人工智能的系统,通过多视角视频流和任务知识实时监控工业装配任务,无需固定工作环境或视觉标记,已在复杂实际场景中验证其有效性。
English Summary: ViMAT is an AI-driven system that monitors industrial assembly tasks in real-time using multi-view video streams and task knowledge, eliminating the need for rigid setups or visual markers, and has been proven effective in complex scenarios.
Authors:Hanjun Kim, Minwoo Jung, Wooseong Yang, Ayoung Kim
Abstract:
Despite the growing adoption of radar in robotics, the majority of research has been confined to homogeneous sensor types, overlooking the integration and cross-modality challenges inherent in heterogeneous radar technologies. This leads to significant difficulties in generalizing across diverse radar data types, with modality-aware approaches that could leverage the complementary strengths of heterogeneous radar remaining unexplored. To bridge these gaps, we propose SHeRLoc, the first deep network tailored for heterogeneous radar, which utilizes RCS polar matching to align multimodal radar data. Our hierarchical optimal transport-based feature aggregation method generates rotationally robust multi-scale descriptors. By employing FFT-similarity-based data mining and adaptive margin-based triplet loss, SHeRLoc enables FOV-aware metric learning. SHeRLoc achieves an order of magnitude improvement in heterogeneous radar place recognition, increasing recall@1 from below 0.1 to 0.9 on a public dataset and outperforming state of-the-art methods. Also applicable to LiDAR, SHeRLoc paves the way for cross-modal place recognition and heterogeneous sensor SLAM. The source code will be available upon acceptance.
中文: SHeRLoc是首个专为异构雷达设计的深度网络,通过RCS极坐标匹配和分层最优传输方法,在位置识别中实现了召回率@1从低于0.1提升至0.9的突破性进展。
English: SHeRLoc is a novel deep network designed for heterogeneous radar, employing RCS polar matching and hierarchical optimal transport to achieve robust place recognition with an order of magnitude improvement in recall@1.
Authors:Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, Mengdi Wang
Abstract:
While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention. Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents.
中文: 本文提出AgentDistill这一无需训练的智能体蒸馏框架,通过直接复用教师智能体自主生成的结构化模型上下文协议(MCP),使基于小语言模型的学生智能体能在生物医学和数学基准测试中达到与GPT-4o等先进系统相媲美的性能,并实现跨领域任务泛化。
English: This paper introduces AgentDistill, a training-free framework for distilling LLM-based agents by reusing structured Model-Context-Protocols (MCPs) generated by teacher agents, enabling small student models to achieve performance comparable to advanced systems like GPT-4o while generalizing across domains with minimal supervision.
Authors:Xiyu Zhao, Qimei Cui, Wei Ni, Quan Z. Sheng, Abbas Jamalipour, Guoshun Nan, Xiaofeng Tao, Ping Zhang
Abstract:
The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device-to-device transmissions to minimize the time-average LoIU within a robot team. A semi-decentralized Multi-Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.
中文摘要:本文提出了一种名为信息效用损失(LoIU)的新指标,用于量化机器人团队协作中信息的新鲜度和效用,并通过半分散式框架优化传输调度和资源分配,在带宽限制下显著提升了信息传递效率。
English Summary: This paper introduces a novel metric called Loss of Information Utility (LoIU) to evaluate information freshness and utility in robot teams, proposing a semi-decentralized framework that optimizes transmission scheduling and resource allocation to significantly enhance cooperative performance under bandwidth constraints.
Authors:Yuanlong Wang, Pengqi Wang, Changchang Yin, Ping Zhang
Abstract:
Living environments play a vital role in the prevalence and progression of diseases, and understanding their impact on patient's health status becomes increasingly crucial for developing AI models. However, due to the lack of long-term and fine-grained spatial and temporal data in public and population health studies, most existing studies fail to incorporate environmental data, limiting the models' performance and real-world application. To address this shortage, we developed SatHealth, a novel dataset combining multimodal spatiotemporal data, including environmental data, satellite images, all-disease prevalences estimated from medical claims, and social determinants of health (SDoH) indicators. We conducted experiments under two use cases with SatHealth: regional public health modeling and personal disease risk prediction. Experimental results show that living environmental information can significantly improve AI models' performance and temporal-spatial generalizability on various tasks. Finally, we deploy a web-based application to provide an exploration tool for SatHealth and one-click access to both our data and regional environmental embedding to facilitate plug-and-play utilization. SatHealth is now published with data in Ohio, and we will keep updating SatHealth to cover the other parts of the US. With the web application and published code pipeline, our work provides valuable angles and resources to include environmental data in healthcare research and establishes a foundational framework for future research in environmental health informatics.
中文: SatHealth数据集通过整合多模态时空数据解决了健康研究中环境数据缺失的问题,显著提升了AI模型在公共卫生和疾病预测任务中的性能与泛化能力。
English: The SatHealth dataset addresses the lack of environmental data in health studies by combining multimodal spatiotemporal information, significantly enhancing AI model performance and generalizability in public health and disease prediction tasks.
Authors:Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh
Abstract:
This study presents a question-based knowledge encoding approach that improves retrieval-augmented generation (RAG) systems without requiring fine-tuning or traditional chunking. We encode textual content using generated questions that span the lexical and semantic space, creating targeted retrieval cues combined with a custom syntactic reranking method.
In single-hop retrieval over 109 scientific papers, our approach achieves a Recall@3 of 0.84, outperforming traditional chunking methods by 60 percent. We also introduce "paper-cards", concise paper summaries under 300 characters, which enhance BM25 retrieval, increasing MRR@3 from 0.56 to 0.85 on simplified technical queries.
For multihop tasks, our reranking method reaches an F1 score of 0.52 with LLaMA2-Chat-7B on the LongBench 2WikiMultihopQA dataset, surpassing chunking and fine-tuned baselines which score 0.328 and 0.412 respectively.
This method eliminates fine-tuning requirements, reduces retrieval latency, enables intuitive question-driven knowledge access, and decreases vector storage demands by 80%, positioning it as a scalable and efficient RAG alternative.
中文摘要:本研究提出了一种基于问题的知识编码方法,无需微调即可提升检索增强生成系统的性能,显著提高召回率并降低存储需求与延迟。
English Summary: This study introduces a question-based encoding method that enhances retrieval-augmented generation systems by improving recall and efficiency without fine-tuning, while reducing storage needs and latency.
Authors:David Bani-Harouni, Chantal Pellegrini, Ege Ãzsoy, Matthias Keicher, Nassir Navab
Abstract:
Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.
Chinese: 该研究提出LA-CDM,一种基于假设的语言代理,通过循环请求和解读临床检验来模拟医疗决策过程,采用混合训练方法提高真实世界数据中的诊断准确性与效率。
English: The study introduces LA-CDM, a hypothesis-driven language agent that models clinical decision-making as an iterative process of requesting and interpreting tests, trained with hybrid methods to improve diagnostic accuracy and efficiency on real-world data.
Authors:Jules Jacobs, Nate Foster, Tobias Kappé, Dexter Kozen, Lily Saada, Alexandra Silva, Jana Wagemaker
Abstract:
We develop StacKAT, a network verification language featuring loops, finite state variables, nondeterminism, and - most importantly - access to a stack with accompanying push and pop operations. By viewing the variables and stack as the (parsed) headers and (to-be-parsed) contents of a network packet, StacKAT can express a wide range of network behaviors including parsing, source routing, and telemetry. These behaviors are difficult or impossible to model using existing languages like NetKAT. We develop a decision procedure for StacKAT program equivalence, based on finite automata. This decision procedure provides the theoretical basis for verifying network-wide properties and is able to provide counterexamples for inequivalent programs. Finally, we provide an axiomatization of StacKAT equivalence and establish its completeness.
中文: StacKAT是一种新型网络验证语言,集成了循环、状态变量、非确定性和堆栈操作,能够全面建模复杂网络行为,并具备程序等价性判定机制及反例生成功能。
English: StacKAT is a novel network verification language incorporating loops, state variables, nondeterminism, and stack operations, enabling comprehensive modeling of complex network behaviors and featuring a decision procedure for program equivalence with counterexample generation.
Authors:Lukasz Mazur, Nenad Petrovic, James Pontes Miranda, Ansgar Radermacher, Robert Rasche, Alois Knoll
Abstract:
Large language models (LLMs) offer new opportunities for interacting with complex software artifacts, such as software models, through natural language. They present especially promising benefits for large software models that are difficult to grasp in their entirety, making traditional interaction and analysis approaches challenging. This paper investigates two approaches for leveraging LLMs to answer questions over software models: direct prompting, where the whole software model is provided in the context, and an agentic approach combining LLM-based agents with general-purpose file access tools. We evaluate these approaches using an Ecore metamodel designed for timing analysis and software optimization in automotive and embedded domains. Our findings show that while the agentic approach achieves accuracy comparable to direct prompting, it is significantly more efficient in terms of token usage. This efficiency makes the agentic approach particularly suitable for the automotive industry, where the large size of software models makes direct prompting infeasible, establishing LLM agents as not just a practical alternative but the only viable solution. Notably, the evaluation was conducted using small LLMs, which are more feasible to be executed locally - an essential advantage for meeting strict requirements around privacy, intellectual property protection, and regulatory compliance. Future work will investigate software models in diverse formats, explore more complex agent architectures, and extend agentic workflows to support not only querying but also modification of software models.
中文: 本研究证明,在查询大型软件模型时,采用基于大语言模型的智能体配合文件访问工具,比直接提示法显著节省计算资源,对于汽车等受模型规模和隐私要求限制的行业而言,这是唯一可行的解决方案。
English: This study demonstrates that using LLM-based agents with file access tools is significantly more token-efficient than direct prompting for querying large software models, making it the only viable solution for industries like automotive where model size and privacy concerns are critical.
Authors:Kamilia Zaripova, Ege Ãzsoy, Nassir Navab, Azade Farshad
Abstract:
Identifying causative genes from patient phenotypes remains a significant challenge in precision medicine, with important implications for the diagnosis and treatment of genetic disorders. We propose a novel graph-based approach for predicting causative genes from patient phenotypes, with or without an available list of candidate genes, by integrating a rare disease knowledge graph (KG). Our model, combining graph neural networks and transformers, achieves substantial improvements over the current state-of-the-art. On the real-world MyGene2 dataset, it attains a mean reciprocal rank (MRR) of 24.64\% and nDCG@100 of 33.64\%, surpassing the best baseline (SHEPHERD) at 19.02\% MRR and 30.54\% nDCG@100. We perform extensive ablation studies to validate the contribution of each model component. Notably, the approach generalizes to cases where only phenotypic data are available, addressing key challenges in clinical decision support when genomic information is incomplete.
Chinese: 一种新型的基于图的方法,通过整合罕见病知识图谱与神经网络,在仅凭患者表型预测致病基因方面显著优于现有方法,即使基因组信息不完整也能有效应对。
English: A novel graph-based method integrating a rare disease knowledge graph with neural networks significantly outperforms existing approaches in predicting causative genes from patient phenotypes, even without complete genomic data.
Authors:Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An
Abstract:
Recent advances in agent systems have demonstrated remarkable capabilities in solving both general-purpose and highly complex tasks. However, most current models lack mechanisms for coordinating specialized agents and have limited ability to generalize to new or diverse domains. To this end, we introduce AgentOrchestra, a hierarchical multi-agent framework for general-purpose task solving that integrates high-level planning with modular agent collaboration. Drawing inspiration from a conductor orchestrating a symphony, and grounded in the principles of extensibility, multimodality, modularity, and coordination, it features a central planning agent that decomposes complex objectives and delegates sub-tasks to a team of specialized agents. Each sub-agent is equipped with general programming tools, as well as abilities to tackle a wide range of real-world specific tasks, including data analysis, file operations, web navigation, and interactive reasoning in dynamic multimodal environments. Notably, AgentOrchestra introduces an MCP Manager Agent that enables intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms, significantly enhancing the system's adaptability and scalability. AgentOrchestra supports flexible orchestration through explicit sub-goal formulation, inter-agent communication, and adaptive role allocation. We evaluate the framework on three widely used benchmarks for assessing LLM-based agent systems. Experimental results show that AgentOrchestra consistently outperforms flat-agent and monolithic baselines in terms of task success rate and adaptability. On the GAIA benchmark testing dataset, AgentOrchestra achieves an average score of 83.39\%, ranking among the top general-purpose agents. These results highlight the effectiveness of hierarchical organization and role specialization in building scalable and general-purpose LLM-based agent systems.
中文摘要:提出的工具-环境-代理(TEA)协议通过将环境、代理和工具整合为统一框架,解决了现有代理系统的不足,而基于此的分层多代理框架AgentOrchestra在多个基准测试中实现了最优性能。
English Summary: The proposed Tool-Environment-Agent (TEA) Protocol addresses limitations in current agent systems by integrating environments, agents, and tools into a unified framework, while the AgentOrchestra hierarchical framework demonstrates state-of-the-art performance across multiple benchmarks.
Authors:Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An
Abstract:
Recent advances in LLMs-based agent systems have demonstrated remarkable capabilities in solving complex tasks. Nevertheless, current protocols (e.g., A2A and MCP) suffer from insufficient capabilities in context management, limited adaptability to diverse environments, and the absence of dynamic agent architectures. To address these limitations, we propose the Tool-Environment-Agent (TEA) Protocol, which establishes a principled basis for integrating environments, agents, and tools into an unified system. The TEA protocol treats environments and agents as first-class resources, enabling comprehensive context management and adaptive environment integration. Based on this protocol, we introduce AgentOrchestra, a hierarchical multi-agent framework with a central planning agent that decomposes complex objectives and coordinates specialized agents. Each sub-agent is dedicated to specific functions, providing capabilities for data analysis, file operations, web navigation, and interactive reasoning. Notably, AgentOrchestra introduces a tool manager agent that supports intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms. Experiments on three widely used benchmarks show that AgentOrchestra consistently outperforms existing baselines, achieving state-of-the-art performance of 83.39% on GAIA and ranking among the top general-purpose LLM-based agents. These results highlight the effectiveness of the TEA Protocol and hierarchical organization in building general-purpose multi-agent systems.
中文摘要:提出的工具-环境-代理(TEA)协议通过将环境、代理和工具整合为统一框架,解决了现有代理系统的不足,而基于此的分层多代理框架AgentOrchestra在多个基准测试中实现了最优性能。
English Summary: The proposed Tool-Environment-Agent (TEA) Protocol addresses limitations in current agent systems by integrating environments, agents, and tools into a unified framework, while the AgentOrchestra hierarchical framework demonstrates state-of-the-art performance across multiple benchmarks.
Authors:Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen
Abstract:
Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.
Chinese: 本文提出SP-PRM双一致性框架,将过程奖励模型引入奖励引导搜索以解决大语言模型对齐中的粒度不匹配问题,在多项任务中实现GPT-4评估分数3.6%-10.3%的提升。
English: This paper introduces SP-PRM, a dual-consistency framework that integrates process reward models into reward-guided search to resolve the granularity mismatch in LLM alignment, achieving 3.6%-10.3% improvement in GPT-4 evaluation scores across multiple tasks.
Authors:Yinghao Ma, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa
Abstract:
Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.
中文:CMI-Bench作为一个综合性音乐指令遵循基准,将多种音乐信息检索任务重新诠释为标准格式,揭示了音频-文本大语言模型的性能差距与偏见,同时建立了统一的评估基础。
English: CMI-Bench is introduced as a comprehensive music instruction-following benchmark that reinterprets diverse MIR tasks into standardized formats, revealing performance gaps and biases in audio-text LLMs while establishing a unified evaluation foundation.
Authors:Mohammed Elhenawy, Shadi Jaradat, Taqwa I. Alhadidi, Huthaifa I. Ashqar, Ahmed Jaber, Andry Rakotonirainy, Mohammad Abu Tami
Abstract:
Scene understanding is critical for various downstream tasks in autonomous driving, including facilitating driver-agent communication and enhancing human-centered explainability of autonomous vehicle (AV) decisions. This paper evaluates the capability of four multimodal large language models (MLLMs), including relatively small models, to understand scenes in a zero-shot, in-context learning setting. Additionally, we explore whether combining these models using an ensemble approach with majority voting can enhance scene understanding performance. Our experiments demonstrate that GPT-4o, the largest model, outperforms the others in scene understanding. However, the performance gap between GPT-4o and the smaller models is relatively modest, suggesting that advanced techniques such as improved in-context learning, retrieval-augmented generation (RAG), or fine-tuning could further optimize the smaller models' performance. We also observe mixed results with the ensemble approach: while some scene attributes show improvement in performance metrics such as F1-score, others experience a decline. These findings highlight the need for more sophisticated ensemble techniques to achieve consistent gains across all scene attributes. This study underscores the potential of leveraging MLLMs for scene understanding and provides insights into optimizing their performance for autonomous driving applications.
中文摘要:本研究评估多模态大语言模型在自动驾驶场景理解中的表现,发现GPT-4o性能最优,同时指出小模型通过优化技术具有潜力,且集成方法效果不一。
English Summary: This study evaluates multimodal large language models for autonomous driving scene understanding, finding GPT-4o superior while noting smaller models' potential through optimization techniques and mixed ensemble method results.
Authors:Daniya Najiha Abdul Kareem, Abdul Hannan, Mubashir Noman, Jean Lahoud, Mustansar Fiaz, Hisham Cholakkal
Abstract:
Accurate microscopic medical image segmentation plays a crucial role in diagnosing various cancerous cells and identifying tumors. Driven by advancements in deep learning, convolutional neural networks (CNNs) and transformer-based models have been extensively studied to enhance receptive fields and improve medical image segmentation task. However, they often struggle to capture complex cellular and tissue structures in challenging scenarios such as background clutter and object overlap. Moreover, their reliance on the availability of large datasets for improved performance, along with the high computational cost, limit their practicality. To address these issues, we propose an efficient framework for the segmentation task, named InceptionMamba, which encodes multi-stage rich features and offers both performance and computational efficiency. Specifically, we exploit semantic cues to capture both low-frequency and high-frequency regions to enrich the multi-stage features to handle the blurred region boundaries (e.g., cell boundaries). These enriched features are input to a hybrid model that combines an Inception depth-wise convolution with a Mamba block, to maintain high efficiency and capture inherent variations in the scales and shapes of the regions of interest. These enriched features along with low-resolution features are fused to get the final segmentation mask. Our model achieves state-of-the-art performance on two challenging microscopic segmentation datasets (SegPC21 and GlaS) and two skin lesion segmentation datasets (ISIC2017 and ISIC2018), while reducing computational cost by about 5 times compared to the previous best performing method.
中文摘要:提出的InceptionMamba框架通过多阶段特征编码与混合Inception-Mamba架构,有效解决了显微医学图像分割中的复杂结构识别难题,在实现最优性能的同时将计算成本降低了五倍。
English Summary: The proposed InceptionMamba framework effectively addresses challenges in microscopic medical image segmentation by combining multi-stage feature encoding with a hybrid Inception-Mamba architecture, achieving state-of-the-art performance while reducing computational costs by fivefold.
Authors:Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto
Abstract:
Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.
Chinese: 为解决从海量数据中选择相关信息进行推理的难题,我们提出了检索上下文优化(RICO)方法,利用大语言模型的梯度优化文档检索,无需微调即可超越传统方法的表现。
English: To address the challenge of selecting relevant data for inference from vast datasets, we introduce Retrieval In-Context Optimization (RICO), a method that uses LLM gradients to optimize document retrieval, outperforming traditional approaches without requiring fine-tuning.
Authors:Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao
Abstract:
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities. By interacting with external environments through predefined tools, these agents can carry out complex user tasks. Nonetheless, this interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior, potentially resulting in economic loss, privacy leakage, or system compromise. System-level defenses have recently shown promise by enforcing static or predefined policies, but they still face two key challenges: the ability to dynamically update security rules and the need for memory stream isolation. To address these challenges, we propose DRIFT, a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control- and data-level constraints. A Secure Planner first constructs a minimal function trajectory and a JSON-schema-style parameter checklist for each function node based on the user query. A Dynamic Validator then monitors deviations from the original plan, assessing whether changes comply with privilege limitations and the user's intent. Finally, an Injection Isolator detects and masks any instructions that may conflict with the user query from the memory stream to mitigate long-term risks. We empirically validate the effectiveness of DRIFT on the AgentDojo benchmark, demonstrating its strong security performance while maintaining high utility across diverse models -- showcasing both its robustness and adaptability.
中文摘要:DRIFT是一个动态规则隔离框架,通过安全规划器构建最小函数轨迹、动态验证器监控执行偏差、注入隔离器屏蔽冲突指令,有效防御提示注入攻击,在保障系统安全性的同时维持高实用性。
English Summary: DRIFT is a dynamic rule-based security framework that protects LLM-based agentic systems from prompt injection attacks by enforcing control- and data-level constraints through secure planning, dynamic validation, and memory isolation.
Authors:Hao Shen, Ming Hu, Xiaofei Xie, Jiaye Li, Mingsong Chen
Abstract:
Although modern vulnerability detection tools enable developers to efficiently identify numerous security flaws, indiscriminate remediation efforts often lead to superfluous development expenses. This is particularly true given that a substantial portion of detected vulnerabilities either possess low exploitability or would incur negligible impact in practical operational environments. Consequently, vulnerability severity assessment has emerged as a critical component in optimizing software development efficiency. Existing vulnerability assessment methods typically rely on manually crafted descriptions associated with source code artifacts. However, due to variability in description quality and subjectivity in intention interpretation, the performance of these methods is seriously limited. To address this issue, this paper introduces VulStamp, a novel intention-guided framework, to facilitate description-free vulnerability assessment. Specifically, VulStamp adopts static analysis together with Large Language Model (LLM) to extract the intention information of vulnerable code. Based on the intention information, VulStamp uses a prompt-tuned model for vulnerability assessment. Furthermore, to mitigate the problem of imbalanced data associated with vulnerability types, VulStamp integrates a Reinforcement Learning (RL)-based prompt-tuning method to train the assessment model.
中文摘要:现代漏洞检测工具常因标记低风险漏洞导致不必要的修复成本,为此本文提出VulStamp框架,通过静态分析和大型语言模型提取漏洞代码意图,实现无需人工描述的自动化漏洞评估。
English Summary: Modern vulnerability detection tools often lead to unnecessary costs by flagging low-risk flaws, prompting the development of VulStamp, an intention-guided framework that uses static analysis and LLMs for efficient vulnerability assessment without relying on manual descriptions.
Authors:Shunpeng Yang, Zhen Fu, Zhefeng Cao, Guo Junde, Patrick Wensing, Wei Zhang, Hua Chen
Abstract:
Generalizing locomotion policies across diverse legged robots with varying morphologies is a key challenge due to differences in observation/action dimensions and system dynamics. In this work, we propose Multi-Loco, a novel unified framework combining a morphology-agnostic generative diffusion model with a lightweight residual policy optimized via reinforcement learning (RL). The diffusion model captures morphology-invariant locomotion patterns from diverse cross-embodiment datasets, improving generalization and robustness. The residual policy is shared across all embodiments and refines the actions generated by the diffusion model, enhancing task-aware performance and robustness for real-world deployment. We evaluated our method with a rich library of four legged robots in both simulation and real-world experiments. Compared to a standard RL framework with PPO, our approach -- replacing the Gaussian policy with a diffusion model and residual term -- achieves a 10.35% average return improvement, with gains up to 13.57% in wheeled-biped locomotion tasks. These results highlight the benefits of cross-embodiment data and composite generative architectures in learning robust, generalized locomotion skills.
中文摘要:Multi-Loco框架通过结合形态无关的扩散模型与轻量级残差策略,实现了不同形态腿式机器人间的运动策略泛化,在仿真和实物实验中均取得了显著性能提升。
English Summary: The proposed Multi-Loco framework combines a morphology-agnostic diffusion model with a residual policy to generalize locomotion skills across diverse legged robots, achieving significant performance improvements in both simulation and real-world experiments.
Authors:Rui Wang, Renyu Zhu, Minmin Lin, Runze Wu, Tangjie Lv, Changjie Fan, Haobo Wang
Abstract:
Confidence estimation is crucial for reflecting the reliability of large language models (LLMs), particularly in the widely used closed-source models. Utilizing data augmentation for confidence estimation is viable, but discussions focus on specific augmentation techniques, limiting its potential. We study the impact of different data augmentation methods on confidence estimation. Our findings indicate that data augmentation strategies can achieve better performance and mitigate the impact of overconfidence. We investigate the influential factors related to this and discover that, while preserving semantic information, greater data diversity enhances the effectiveness of augmentation. Furthermore, the impact of different augmentation strategies varies across different range of application. Considering parameter transferability and usability, the random combination of augmentations is a promising choice.
中文摘要:数据增强通过提升性能与缓解过度自信来改进大型语言模型的置信度估计,其中数据多样性增强与随机组合策略在不同应用场景中展现出最佳效果。
English Summary: Data augmentation improves confidence estimation in large language models by enhancing performance and reducing overconfidence, with greater diversity and random combinations proving most effective across various applications.
Authors:Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi, Imran Razzak
Abstract:
Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.
Chinese: Flick提出了一种新颖的伪标签优化方法,通过从多聚类中提炼高置信度标签来改进少标签文本分类,在涵盖14种低资源语言的多样化数据集中展现出卓越性能。
English: Flick introduces a novel pseudo-label refinement method that distills high-confidence labels from multiple clusters to enhance few-label text classification, demonstrating superior performance across 14 diverse low-resource language datasets.
Authors:Wei Li, Mengcheng Lan, Jiaxing Xu, Yiping Ke
Abstract:
Graphs are essential for modeling complex interactions across domains such as social networks, biology, and recommendation systems. Traditional Graph Neural Networks, particularly Message Passing Neural Networks (MPNNs), rely heavily on supervised learning, limiting their generalization and applicability in label-scarce scenarios. Recent self-supervised approaches still require labeled fine-tuning, limiting their effectiveness in zero-shot scenarios. Meanwhile, Large Language Models (LLMs) excel in natural language tasks but face significant challenges when applied to graphs, including preserving reasoning abilities, managing extensive token lengths from rich node attributes, and being limited to textual-attributed graphs (TAGs) and a single level task. To overcome these limitations, we propose the Node-Oriented Conceptualization LLM (NOCL), a novel framework that leverages two core techniques: 1) node description, which converts heterogeneous node attributes into structured natural language, extending LLM from TAGs to non-TAGs; 2) node concept, which encodes node descriptions into compact semantic embeddings using pretrained language models, significantly reducing token lengths by up to 93.9% compared to directly using node descriptions. Additionally, our NOCL employs graph representation descriptors to unify graph tasks at various levels into a shared, language-based query format, paving a new direction for Graph Foundation Models. Experimental results validate NOCL's competitive supervised performance relative to traditional MPNNs and hybrid LLM-MPNN methods and demonstrate superior generalization in zero-shot settings.
中文: 提出的节点导向概念化大模型(NOCL)通过将节点属性转化为结构化语言和紧凑语义嵌入,克服了传统图神经网络与大型语言模型的局限,在监督学习和零样本图任务中均展现出卓越性能。
English: The proposed Node-Oriented Conceptualization LLM (NOCL) framework overcomes limitations of traditional graph neural networks and large language models by converting node attributes into structured language and compact semantic embeddings, enabling competitive performance in supervised and zero-shot graph tasks.
Authors:Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao
Abstract:
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
中文: 本文提出了一种流式内容监控器(SCM),通过使用FineHarm数据集的细粒度标注进行训练,能在仅评估LLM输出前18%内容的情况下实现高效的部分有害内容检测,其准确率与完整检测相当。
English: This paper introduces a streaming content monitor (SCM) trained with fine-grained token-level annotations from the FineHarm dataset, enabling efficient partial detection of harmful content in LLM outputs by assessing only the initial 18% of tokens while maintaining high accuracy comparable to full detection.
Authors:Shuangyang Li, Fan Liu, Yifeng Xiong, Weijie Yuan, Baoming Bai, Christos Masouros, Giuseppe Caire
Abstract:
In this paper, we provide an analytical study of single-carrier faster-than-Nyquist (FTN) signaling for integrated sensing and communications (ISAC). Our derivations show that FTN is advantageous for ISAC, and reveal new insights that these advantages come from the fact that FTN signaling can effectively avoid the spectral aliasing due to the mismatch between the symbol rate and the bandwidth of the shaping pulse. Specifically, the communication spectral efficiency advantages of FTN signaling over time-invariant multipath channels are analytically shown, where both upper- and lower-bounds on the spectral efficiency are derived. We show that the gap between these two bounds corresponds to the potential signal-to-noise ratio (SNR) variation due to the presence of multipath delay and spectral aliasing, which diminishes as the symbol rate grows higher. Particularly, in the limiting case, this SNR variation disappears while the degree of freedom (DoF) of the system attain the maximum. Furthermore, the sensing advantages for FTN signals are verified in terms of the expected normalized squared ambiguity function. We show that FTN signals generally enjoy a more robust ranging performance. More importantly, we prove that FTN signaling can effectively avoid the undesired peaks in the considered ambiguity function along the Doppler dimension, thereby reducing the ambiguities in velocity estimation. All these conclusions are explicitly verified by numerical results.
中文: 本研究证明,超奈奎斯特信号通过优化符号速率有效避免频谱混叠,从而提升综合感知与通信系统的频谱效率和感知鲁棒性,并降低速度估计的模糊性。
English: This study demonstrates that faster-than-Nyquist signaling benefits integrated sensing and communications by avoiding spectral aliasing through optimized symbol rates, enhancing both spectral efficiency and sensing robustness with reduced ambiguities in velocity estimation.
Authors:Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet
Abstract:
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.
中文:UmbraTTS是一种基于流匹配的文本转语音模型,能够联合生成语音和环境音频,通过自监督框架解决配对训练数据不足的问题,并在生成自然、情境感知的音频场景方面显著优于现有基线。
English: UmbraTTS is a flow-matching TTS model that jointly generates speech and environmental audio with fine-grained control, using a self-supervised framework to overcome the lack of paired training data and outperforming existing baselines in producing natural, context-aware audio scenes.
Authors:Gabriele Codega, Anna Ivagnes, Nicola Demo, Gianluigi Rozza
Abstract:
In the present work, we introduce a data-driven approach to enhance the accuracy of non-intrusive Reduced Order Models (ROMs). In particular, we focus on ROMs built using Proper Orthogonal Decomposition (POD) in an under-resolved and marginally-resolved regime, i.e. when the number of modes employed is not enough to capture the system dynamics. We propose a method to re-introduce the contribution of neglected modes through a quadratic correction term, given by the action of a quadratic operator on the POD coefficients. Differently from the state-of-the-art methodologies, where the operator is learned via least-squares optimisation, we propose to parametrise the operator by a Multi-Input Operators Network (MIONet). This way, we are able to build models with higher generalisation capabilities, where the operator itself is continuous in space -- thus agnostic of the domain discretisation -- and parameter-dependent. We test our model on two standard benchmarks in fluid dynamics and show that the correction term improves the accuracy of standard POD-based ROMs.
中文: 本研究提出一种数据驱动方法,通过多输入算子网络引入二次校正项来补偿被忽略模态的影响,从而在流体动力学基准测试中提高了基于本征正交分解的降阶模型精度。
English: This study presents a data-driven method that uses a Multi-Input Operators Network to incorporate neglected mode contributions through a quadratic correction, enhancing the accuracy of Proper Orthogonal Decomposition-based Reduced Order Models in fluid dynamics benchmarks.
Authors:Andrea Caraffa, Davide Boscaini, Fabio Poiesi
Abstract:
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
中文: FreeZeV2提出了一种无需训练的6D姿态估计方法,利用预训练模型在未见物体上实现了最先进的精度和效率,无需特定任务的训练。
English: FreeZeV2 introduces a training-free approach for 6D pose estimation that leverages pre-trained models to achieve state-of-the-art accuracy and efficiency on unseen objects without task-specific training.
Authors:Mattia Nardon, Mikel Mujika Agirre, Ander González Tomé, Daniel Sedano Algarabel, Josep Rueda Collell, Ana Paola Caro, Andrea Caraffa, Fabio Poiesi, Paul Ian Chippendale, Davide Boscaini
Abstract:
Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.
中文: CHIP是首个针对真实工业环境中椅子6D姿态估计的数据集,通过包含机器人操作场景、严重遮挡和精细干扰物,弥补了现有基准在工业应用评估上的不足。
English: CHIP is the first dataset designed for 6D pose estimation of chairs in real industrial settings, addressing limitations of existing benchmarks by featuring robotic manipulation scenarios with severe occlusions and fine-grained distractors.
Authors:Baran Can Gül, Stefanos Tziampazis, Nasser Jazdi, Michael Weyrich
Abstract:
As Federated Learning (FL) expands to larger and more distributed environments, consistency in training is challenged by network-induced delays, clock unsynchronicity, and variability in client updates. This combination of factors may contribute to misaligned contributions that undermine model reliability and convergence. Existing methods like staleness-aware aggregation and model versioning address lagging updates heuristically, yet lack mechanisms to quantify staleness, especially in latency-sensitive and cross-regional deployments. In light of these considerations, we introduce \emph{SyncFed}, a time-aware FL framework that employs explicit synchronization and timestamping to establish a common temporal reference across the system. Staleness is quantified numerically based on exchanged timestamps under the Network Time Protocol (NTP), enabling the server to reason about the relative freshness of client updates and apply temporally informed weighting during aggregation. Our empirical evaluation on a geographically distributed testbed shows that, under \emph{SyncFed}, the global model evolves within a stable temporal context, resulting in improved accuracy and information freshness compared to round-based baselines devoid of temporal semantics.
中文:SyncFed提出了一种时间感知的联邦学习框架,通过显式同步和时间戳量化更新延迟,在分布式环境中提升了模型准确性和信息新鲜度。
English: SyncFed introduces a time-aware federated learning framework that uses explicit synchronization and timestamping to quantify update staleness, improving model accuracy and freshness in distributed environments.
Authors:Dongxu Liu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han, Zheng Ge, Daxin Jiang, Mingxue Liao
Abstract:
Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.
中文: DGAE采用扩散模型引导解码器,通过增强表达能力在高压缩率下减少性能损失,并以减半的潜在空间实现最优性能,有效提升图像生成效率。
English: DGAE introduces a diffusion-guided autoencoder that enhances decoder expressiveness to mitigate performance loss under high compression and achieves state-of-the-art results with a halved latent space, improving efficiency in image generation tasks.
Authors:Jiaqi Samantha Zhan, Crystina Zhang, Shengyao Zhuang, Xueguang Ma, Jimmy Lin
Abstract:
Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data--visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.
中文: 本文采用Tevatron 2.0工具包中的OmniEmbed模型,通过MultiVENT 2.0多模态数据微调,在MAGMaR评测任务中取得了最优性能,展现了统一多模态检索方法在视频检索中的显著效果。
English: This paper presents a unified multimodal retrieval method using the OmniEmbed model from Tevatron 2.0, achieving top performance on the MAGMaR shared task by fine-tuning with MultiVENT 2.0 data and demonstrating robust video retrieval across multiple modalities.
Authors:Ziyi Chen, Yiyang Liu, Mattia Prosperi, Krishna Vaddiparti, Robert L Cook, Jiang Bian, Yi Guo, Yonghui Wu
Abstract:
Objective: To characterize stigma dimensions, social, and related behavioral circumstances in people living with HIV(PLWHs) seeking care, using NLP methods applied to a large collection of EHR clinical notes from a large integrated health system in the southeast United States. Methods: We identified a cohort of PLWHs from the UF Health IDR and performed topic modeling analysis using Latent Dirichlet Allocation to uncover stigma-related dimensions and related social and behavioral contexts. Domain experts created a seed list of HIV-related stigma keywords, then applied a snowball strategy to review notes for additional terms until saturation was reached iteratively. To identify more target topics, we tested three keyword-based filtering strategies. The detected topics were evaluated using three widely used metrics and manually reviewed by specialists. In addition, we conducted word frequency analysis and topic variation analysis among subgroups to examine differences across age and sex-specific demographics. Results: We identified 9140 PLWHs at UF Health and collected 2.9 million clinical notes. Through the iterative keyword approach, we generated a list of 91 keywords associated with HIV-related stigma. Topic modeling on sentences containing at least one keyword uncovered a wide range of topic themes, such as "Mental Health Concern, Stigma", "Treatment Refusal, Isolation", and "Substance Abuse". Topic variation analysis across age subgroups revealed substantial differences. Conclusion: Extracting and understanding the HIV-related stigma and associated social and behavioral circumstances from EHR clinical notes enables scalable, time-efficient assessment and overcoming the limitations of traditional questionnaires. Findings from this research provide actionable insights to inform patient care and interventions to improve HIV-care outcomes.
中文摘要:本研究通过对电子健康记录应用自然语言处理技术,识别并分析了HIV相关污名化及其社会行为因素,揭示了不同人口统计群体间的显著差异,为改善患者护理干预提供了可行见解。
English Summary: This study uses natural language processing on electronic health records to identify and analyze HIV-related stigma and associated social and behavioral factors, revealing significant variations across demographic groups to inform better patient care interventions.
Authors:Haonan Zhang, Guoyan Lao, Yuyao Zhang, Hongjiang Wei
Abstract:
Quantitative magnetic resonance imaging (qMRI) provides tissue-specific parameters vital for clinical diagnosis. Although simultaneous multi-parametric qMRI (MP-qMRI) technologies enhance imaging efficiency, robustly reconstructing qMRI from highly undersampled, high-dimensional measurements remains a significant challenge. This difficulty arises primarily because current reconstruction methods that rely solely on a single prior or physics-informed model to solve the highly ill-posed inverse problem, which often leads to suboptimal results. To overcome this limitation, we propose LoREIN, a novel unsupervised and dual-prior-integrated framework for accelerated 3D MP-qMRI reconstruction. Technically, LoREIN incorporates both low-rank prior and continuity prior via low-rank representation (LRR) and implicit neural representation (INR), respectively, to enhance reconstruction fidelity. The powerful continuous representation of INR enables the estimation of optimal spatial bases within the low-rank subspace, facilitating high-fidelity reconstruction of weighted images. Simultaneously, the predicted multi-contrast weighted images provide essential structural and quantitative guidance, further enhancing the reconstruction accuracy of quantitative parameter maps. Furthermore, our work introduces a zero-shot learning paradigm with broad potential in complex spatiotemporal and high-dimensional image reconstruction tasks, further advancing the field of medical imaging.
中文: LoREIN是一种融合低秩先验与隐式神经表示的无监督双先验框架,通过零样本学习范式实现高质量加速三维多参数定量MRI重建,为复杂医学影像任务提供了新方案。
English: LoREIN is an unsupervised dual-prior framework combining low-rank and implicit neural representations to achieve high-fidelity reconstruction of accelerated 3D multi-parametric qMRI, while introducing a zero-shot learning paradigm with broader medical imaging applications.
Authors:Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang
Abstract:
Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.
中文: 本文提出了UI-NEXUS基准来评估移动代理在组合任务上的表现,揭示了其泛化差距,并设计了AGENT-NEXUS调度系统,通过动态分解复杂操作显著提升了任务完成率。
English: This paper introduces UI-NEXUS, a benchmark for evaluating mobile agents on compositional tasks, revealing their generalization gaps, and proposes AGENT-NEXUS, a scheduling system that significantly improves task success rates by dynamically decomposing complex operations.
Authors:Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu
Abstract:
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
中文: 本文提出了一种专家乘积框架,通过退火重要性采样整合视觉生成模型和模拟器等多样化模型的知识,无需训练即可提升图像和视频合成的可控性。
English: The paper introduces a Product of Experts framework that integrates knowledge from diverse models like visual generative models and simulators through Annealed Importance Sampling, enhancing controllability in image and video synthesis without requiring training.
Authors:Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu
Abstract:
Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.
中文: 本文提出了一种专家乘积框架,通过退火重要性采样整合视觉生成模型和模拟器等多样化模型的知识,无需训练即可提升图像和视频合成的可控性。
English: The paper introduces a Product of Experts framework that integrates knowledge from diverse models like visual generative models and simulators through Annealed Importance Sampling, enhancing controllability in image and video synthesis without requiring training.
Authors:Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Schütze, Michael Färber
Abstract:
Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.
中文摘要:本研究首次系统揭示了MBTI人格特质在仇恨言论检测中对人类标注者和大型语言模型均产生显著影响,表明基于人格的提示会导致明显输出差异,这对确保基于大模型的标注工作流程的公平性和人类价值观对齐提出了重要警示。
English Summary: This study reveals that MBTI personality traits significantly influence hate speech classification in both human annotators and Large Language Models, demonstrating substantial persona-driven variations that necessitate careful prompt design for fairness and alignment with human values.
Authors:Guiyang Luo, Jinglin Li, Qixun Zhang, Zhiyong Feng, Quan Yuan, Yijing Lin, Hui Zhang, Nan Cheng, Ping Zhang
Abstract:
The low-altitude economy (LAE) is rapidly advancing toward intelligence, connectivity, and coordination, bringing new challenges in dynamic airspace management, unmanned aerial vehicle (UAV) operation, and security management. Existing systems remain fragmented and lack effective coordination. To bridge these gaps, we propose UTICN (Ubiquitous and Trusted Intelligent Cellular-native Network) for LAE, a unified cellular-native architecture that integrates multi-domain sensing, high-precision positioning, intelligent aircraft-to-everything communication, dynamic airspace management, and UAV operational services. UTICN introduces key technologies such as integrated sensing and communication (ISAC), passive and active positioning, intelligent machine communication, swarm coordination, and control-data decoupled management frameworks. We demonstrate UTICN's feasibility through two use cases, i.e., a city-level LAE management platform and a multi-frequency collaborative ISAC system. This work provides a fundamental reference for building a unified operational foundation and airspace management architecture for the LAE.
中文摘要:低空经济面临空域管理和无人机运行等挑战,为此提出UTICN这一统一蜂窝原生网络,通过集成感知、定位与通信技术,为构建低空经济统一运营基础提供重要参考。
English Summary: The low-altitude economy faces challenges in airspace management and UAV operations, leading to the proposal of UTICN, a unified cellular-native network integrating sensing, positioning, and communication technologies to establish a foundational operational framework.
Authors:Yang Liu, Armstrong Foundjem, Foutse Khomh, Heng Li
Abstract:
Large Language Models (LLMs) have become vital tools in software development tasks such as code generation, completion, and analysis. As their integration into workflows deepens, ensuring robustness against vulnerabilities especially those triggered by diverse or adversarial inputs becomes increasingly important. Such vulnerabilities may lead to incorrect or insecure code generation when models encounter perturbed task descriptions, code, or comments. Prior research often overlooks the role of natural language in guiding code tasks. This study investigates how adversarial perturbations in natural language inputs including prompts, comments, and descriptions affect LLMs for Code (LLM4Code). It examines the effects of perturbations at the character, word, and sentence levels to identify the most impactful vulnerabilities. We analyzed multiple projects (e.g., ReCode, OpenAttack) and datasets (e.g., HumanEval, MBPP), establishing a taxonomy of adversarial attacks. The first dimension classifies the input type code, prompts, or comments while the second dimension focuses on granularity: character, word, or sentence-level changes. We adopted a mixed-methods approach, combining quantitative performance metrics with qualitative vulnerability analysis. LLM4Code models show varying robustness across perturbation types. Sentence-level attacks were least effective, suggesting models are resilient to broader contextual changes. In contrast, word-level perturbations posed serious challenges, exposing semantic vulnerabilities. Character-level effects varied, showing model sensitivity to subtle syntactic deviations.Our study offers a structured framework for testing LLM4Code robustness and emphasizes the critical role of natural language in adversarial evaluation. Improving model resilience to semantic-level disruptions is essential for secure and reliable code-generation systems.
中文摘要:本研究探讨自然语言输入中的对抗性扰动如何影响代码大语言模型,发现词语层面的攻击通过利用语义漏洞构成最大威胁,而句子层面的改动影响最小。
English Summary: This study investigates how adversarial perturbations in natural language inputs affect Large Language Models for Code, revealing that word-level attacks pose the greatest threat by exploiting semantic vulnerabilities while sentence-level changes show minimal impact.
Authors:Weilei Wen, Tianyi Zhang, Qianqian Zhao, Zhaohui Zheng, Chunle Guo, Xiuli Shao, Chongyi Li
Abstract:
Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: (1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, (2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and (3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. We will release the code upon formal publication.
Chinese: UGTSR框架通过引入不确定性学习、Top-k特征匹配和对齐注意力模块,提升了基于码簿的超分辨率性能,显著改善了纹理真实感和重建保真度。
English: The UGTSR framework introduces uncertainty learning, Top-k feature matching, and an Align-Attention module to enhance codebook-based super-resolution, significantly improving texture realism and reconstruction fidelity.
Authors:Yijie Deng, Shuaihang Yuan, Geeta Chandra Raju Bethala, Anthony Tzes, Yu-Shen Liu, Yi Fang
Abstract:
Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions with high semantic similarity to the target object class, with fine-grained local geometric scoring that performs precise pose estimation within promising regions. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on simulated IIN benchmarks and real-world applicability.
中文: 本文提出了一种用于实例图像目标导航的分层评分框架,结合语义相关性场和几何姿态估计来优化选择视角,在减少冗余的同时,在仿真和现实环境中均实现了最先进的性能。
English: This paper introduces a hierarchical scoring framework for Instance Image-Goal Navigation that combines semantic relevancy fields and geometric pose estimation to optimally select viewpoints, reducing redundancy while achieving state-of-the-art performance in both simulated and real-world environments.
Authors:Kevin Frans, Sergey Levine, Pieter Abbeel
Abstract:
In this work, we take an experimentally grounded look at neural network optimization. Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method. First, we find that naive Shampoo is prone to divergence when matrix-inverses are cached for long periods. We introduce an alternate bounded update combining a historical eigenbasis with instantaneous normalization, resulting in across-the-board stability and significantly lower computational requirements. Second, we adapt a shape-aware scaling to enable learning rate transfer across network width. Third, we find that high learning rates result in large parameter noise, and propose a simple iterate-averaging scheme which unblocks faster learning. To properly confirm these findings, we introduce a pointed Transformer training benchmark, considering three objectives (language modelling, image classification, and diffusion modelling) across different stages of training. On average, SPlus is able to reach the validation performance of Adam within 44% of the gradient steps and 62% of the wallclock time.
中文: 本研究提出SPlus优化方法,通过改进Shampoo算法解决了发散问题、实现跨网络宽度的学习率迁移,并采用迭代平均降低参数噪声,仅需44%梯度步数和62%计算时间即可达到Adam的验证性能。
English: This study introduces SPlus, an enhanced Shampoo-based optimization method that addresses divergence, enables learning rate transfer across network width, and reduces parameter noise through iterate-averaging, achieving Adam's validation performance with significantly fewer gradient steps and computation time.
Authors:LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
中文: 多模态大语言模型在通用视觉任务中表现出色,但在医疗应用中因知识覆盖不足和数据差异而受限,为此研发了基于丰富医疗数据训练的专用模型Lingshu,其在多项核心医疗任务中优于现有模型。
English: Multimodal Large Language Models (MLLMs) excel in general visual tasks but underperform in medical applications due to knowledge gaps and data limitations, prompting the development of Lingshu, a specialized model trained on enriched medical data that surpasses existing models in key medical tasks.
Authors:Renjie He, Yiqiu Wang, Meixia Tao, Shu Sun
Abstract:
This paper investigates the passive detection problem in multi-static integrated sensing and communication (ISAC) systems, where multiple sensing receivers (SRs) jointly detect a target using random unknown communication signals transmitted by a collaborative base station. Unlike traditional active detection, the considered passive detection does not require complete prior knowledge of the transmitted communication signals at each SR. First, we derive a generalized likelihood ratio test detector and conduct an asymptotic analysis of the detection statistic under the large-sample regime. We examine how the signal-to-noise ratios (SNRs) of the target paths and direct paths influence the detection performance. Then, we propose two joint transmit beamforming designs based on the analyses. In the first design, the asymptotic detection probability is maximized while satisfying the signal-to-interference-plus-noise ratio requirement for each communication user under the total transmit power constraint. Given the non-convex nature of the problem, we develop an alternating optimization algorithm based on the quadratic transform and semi-definite relaxation. The second design adopts a heuristic approach that aims to maximize the target energy, subject to a minimum SNR threshold on the direct path, and offers lower computational complexity. Numerical results validate the asymptotic analysis and demonstrate the superiority of the proposed beamforming designs in balancing passive detection performance and communication quality. This work highlights the promise of target detection using unknown communication data signals in multi-static ISAC systems.
中文摘要:本文研究了多基地ISAC系统中利用未知通信信号的被动检测问题,提出了两种波束成形设计方案,在保证通信质量的同时有效提升了目标检测性能。
English Summary: This paper develops passive detection methods for multi-static ISAC systems using unknown communication signals, proposing two beamforming designs that effectively balance detection performance with communication quality.
Authors:Mikhail Krasitskii, Grigori Sidorov, Olga Kolesnikova, Liliana Chanona Hernandez, Alexander Gelbukh
Abstract:
We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages. The approach also demonstrates 22% greater computational efficiency than traditional methods. Practical applications include real-time brand monitoring and cross-cultural discourse analysis. Future work will focus on optimization for low-resource languages via 8-bit quantization.
中文:该多语言情感分析的混合方法结合了抽取式和生成式摘要,在多种语言中实现了高准确率和计算效率,适用于实时监控和跨文化分析的实际场景。
English: This hybrid approach for multilingual sentiment analysis combines extractive and abstractive summarization, achieving high accuracy and computational efficiency across diverse languages with applications in real-time monitoring and cross-cultural analysis.
Authors:Subhendu Khatuya, Shashwat Naidu, Saptarshi Ghosh, Pawan Goyal, Niloy Ganguly
Abstract:
The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.
中文: 我们提出LAGAMC框架,这是一种参数高效的生成式模型,通过利用标签描述和双目标损失函数,在多标签文本分类中实现了最先进的性能,相比基线模型将Micro-F1和Macro-F1分别提升了13.94%和24.85%。
English: We introduce LAGAMC, a parameter-efficient generative model framework that uses label descriptions and a dual-objective loss to achieve state-of-the-art multi-label text classification, improving Micro-F1 by 13.94% and Macro-F1 by 24.85% over baselines.
Authors:Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani
Abstract:
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
Chinese: 现有语言模型安全评估忽略了通过多个看似无害的查询共同促成滥用的隐蔽攻击,为此开发了BSD基准来评估针对此类策略的状态化防御措施。
English: Current safety evaluations for language models overlook covert attacks that use multiple benign-seeming queries to collectively enable misuse, prompting the development of the BSD benchmark to assess stateful defenses against such strategies.
Authors:Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo
Abstract:
A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.
中文摘要:提出的动态渐进参数高效专家库(DMPEL)通过动态组合渐进构建的专家库中的专家,实现终身机器人学习,利用专家系数回放技术有效促进前向迁移并减轻遗忘问题。
English Summary: The proposed Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) enables lifelong robot learning by dynamically combining experts from a progressively built library, achieving superior forward transfer and mitigating forgetting through expert coefficient replay.
Authors:Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, Yu Zhang
Abstract:
Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework's core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.
中文摘要:DynamicMind提出了一种三模式思维系统,使大语言模型能够通过认知启发式提示工程自主选择快速、常规和慢速思维模式进行零样本问答,在性能和计算效率之间实现了最佳平衡。
English Summary: DynamicMind introduces a tri-mode thinking system that enables large language models to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering, achieving optimal balance between performance and computational efficiency.
Authors:Yaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo
Abstract:
As deep learning advances in audio generation, challenges in audio security and copyright protection highlight the need for robust audio watermarking. Recent neural network-based methods have made progress but still face three main issues: preventing unauthorized access, decoding initial watermarks after multiple embeddings, and embedding varying lengths of watermarks. To address these issues, we propose WAKE, the first key-controllable audio watermark framework. WAKE embeds watermarks using specific keys and recovers them with corresponding keys, enhancing security by making incorrect key decoding impossible. It also resolves the overwriting issue by allowing watermark decoding after multiple embeddings and supports variable-length watermark insertion. WAKE outperforms existing models in both watermarked audio quality and watermark detection accuracy. Code, more results, and demo page: https://thuhcsi.github.io/WAKE.
中文: WAKE是首个密钥可控的音频水印框架,通过密钥嵌入和提取水印提升安全性,解决多次嵌入导致的覆盖问题,支持可变长度水印,并在音频质量与水印检测精度上超越现有模型。
English: WAKE is a pioneering key-controllable audio watermarking framework that enhances security by embedding and decoding watermarks with specific keys, prevents overwriting issues during multiple embeddings, supports variable-length watermarks, and surpasses existing models in audio quality and detection accuracy.
Authors:Wenyu Zhu, Jianhui Wang, Bowen Gao, Yinjun Jia, Haichuan Tan, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Abstract:
Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods--whether physics-based or deep learning-based--are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes.
中文: 本文提出一种对齐与聚合框架,通过配体、全息口袋和空腔表征的对齐及动态结合位点聚合,显著提升了结构不确定性下的虚拟筛选准确性,在非结合结构上表现卓越且保持全息结构的强性能。
English: This paper introduces an alignment-and-aggregation framework that enhances virtual screening accuracy under structural uncertainty by aligning ligand, holo pocket, and cavity representations and dynamically aggregating binding sites, significantly outperforming existing methods on apo structures while maintaining strong holo performance.
Authors:Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun
Abstract:
Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions, but current evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility. We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark consists of 153 real-world scenarios that can evaluate syntax, deployment, user intent, and security. Our evaluation reveals that state-of-the-art LLMs initially performed poorly, with Claude-3.5 and Claude-3.7 achieving only 30.2% and 26.8% deployment success on the first attempt respectively. However, IaCGen transforms this performance dramatically: all evaluated models reach over 90% passItr@25, with Claude-3.5 and Claude-3.7 achieving 98% success rate. Despite these improvements, critical challenges remain in user intent alignment (25.2% accuracy) and security compliance (8.4% pass rate), highlighting areas requiring continued research. Our work provides the first comprehensive assessment of deployability-centric IaC template generation and establishes a foundation for future research.
中文: 本研究提出了IaCGen这一以可部署性为核心的框架,通过迭代反馈机制生成基础设施即代码模板,并创建了DPIaC-Eval评估基准,结果表明虽然初始大语言模型部署成功率仅约30%,但经过IaCGen优化后提升至98%,但在用户意图对齐(25.2%)和安全合规(8.4%)方面仍存在显著挑战。
English: This research introduces IaCGen, a deployability-focused framework using iterative feedback to generate Infrastructure-as-Code templates, and DPIaC-Eval, a benchmark for evaluating deployment success, showing that while initial LLM performance was poor, IaCGen boosted deployment rates to over 90% but revealed persistent challenges in user intent alignment and security compliance.
Authors:Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe
Abstract:
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
Large Multimodal Models often produce semantic hallucinations when processing ambiguous visual text, but this can be mitigated through a training-free framework that identifies text regions and corrects outputs using internal representations from less hallucination-prone layers.
English Summary:
Authors:Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe
Abstract:
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of 1,740 samples spanning both semantic and non-semantic cases, with manually curated question answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
Large Multimodal Models often produce semantic hallucinations when processing ambiguous visual text, but this can be mitigated through a training-free framework that identifies text regions and corrects outputs using internal representations from less hallucination-prone layers.
English Summary:
Authors:Maurizio Clemente, Prapti Maharjan, Mauro Salazar, Theo Hofman
Abstract:
This paper investigates the environmental impact of Li-Ion batteries by quantifying manufacturing-related emissions and analyzing how electricity mix and production scale affect emission intensity. To this end, we conduct a meta-analysis of life cycle assessments on lithium-ion batteries published over the past two decades, categorizing them by year, battery chemistry, functional unit, system boundaries, and electricity mix. We then carry out a cradle-to-gate assessment for a nickel manganese cobalt 811 battery with a silicon-coated graphite anode, analyzing how variations in the carbon intensity of the electricity mix affect emissions, with case studies for China, South Korea, and Sweden. Finally, we develop a set of regression models that link annual battery production and the carbon intensity of China's electricity mix to the average mass-specific emissions observed each year. The meta-analysis shows a median global warming potential of 17.63 kg CO2-eq./kg of battery, with a standard deviation of 7.34. Differences in electricity mix mainly influence emissions from the energy-intensive cell production, particularly from cathode material processing. We found that a multivariate linear regression using production volume and the carbon intensity of the Chinese electricity mix as predictors explains emissions with moderate accuracy. The environmental impact of battery manufacturing can be reduced by using clean energy sources in production processes. However, achieving substantial reductions requires clean energy throughout the entire supply chain, as importing materials from regions with carbon-intensive electricity mixes can undermine these efforts. Our findings also highlight the emission-reducing effect of learning associated with increased production scale, supporting the integration of learning effects in future life cycle assessment models.
中文: 本研究量化了锂离子电池制造的环境影响,发现排放主要受电力结构和生产规模影响,使用清洁能源及优化供应链是降低碳足迹的关键。
English: This study quantifies the environmental impact of lithium-ion battery manufacturing, revealing that emissions are primarily influenced by the electricity mix and production scale, with clean energy use and supply chain optimization being key to reducing carbon footprint.
Authors:Shiyuan Feng, Ying Feng, George Z. Li, Zhao Song, David P. Woodruff, Lichen Zhang
Abstract:
Recently differential privacy has been used for a number of streaming, data structure, and dynamic graph problems as a means of hiding the internal randomness of the data structure, so that multiple possibly adaptive queries can be made without sacrificing the correctness of the responses. Although these works use differential privacy to show that for some problems it is possible to tolerate $T$ queries using $\widetilde{O}(\sqrt{T})$ copies of a data structure, such results only apply to numerical estimation problems, and only return the cost of an optimization problem rather than the solution itself. In this paper, we investigate the use of differential privacy for adaptive queries to search problems, which are significantly more challenging since the responses to queries can reveal much more about the internal randomness than a single numerical query. We focus on two classical search problems: nearest neighbor queries and regression with arbitrary turnstile updates. We identify key parameters to these problems, such as the number of $c$-approximate near neighbors and the matrix condition number, and use different differential privacy techniques to design algorithms returning the solution vector with memory and time depending on these parameters. We give algorithms for each of these problems that achieve similar tradeoffs.
中文摘要:本文研究在最近邻查询和回归等搜索问题中应用差分隐私处理自适应查询,开发出根据问题特定参数返回解向量并优化性能的算法。
English Summary: This paper explores using differential privacy for adaptive queries in search problems like nearest neighbor and regression, developing algorithms that return solution vectors with performance dependent on problem-specific parameters.
Authors:Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen
Abstract:
Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small $\ell_1$-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation -- akin to solving Lasso -- at test time.
Chinese: 本文提出了一种通用逼近理论,阐明Transformer如何通过构建模型在测试时寻找目标函数的线性表示来实现上下文学习,其逼近保证不受优化算法限制,可扩展到线性回归以外的广泛问题类别。
English: This paper establishes a universal approximation theory showing how transformers can perform in-context learning by constructing models that find linear representations of target functions with minimal risk, extending beyond conventional optimization mimicry to handle broader problem classes.
Authors:Chantal Pellegrini, Ege Ãzsoy, David Bani-Harouni, Matthias Keicher, Nassir Navab
Abstract:
Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.
中文摘要:本文提出EHR2Path这一新型患者路径建模方法,通过将多样化电子健康记录转化为结构化表示,并采用创新的摘要机制嵌入长期时序上下文,在预测未来健康轨迹方面显著优于现有方法,为实现个性化医疗模拟开辟了新途径。
English Summary: This paper introduces EHR2Path, a novel patient pathway modeling approach that transforms diverse EHR data into structured representations and uses a unique summary mechanism to embed long-term temporal context, significantly outperforming existing methods in predicting future health trajectories and enabling personalized healthcare simulations.
Authors:Andrea Munari, Federico Chiariotti, Leonardo Badia, Petar Popovski
Abstract:
The widespread adoption of age of information (AoI) as a meaningful and analytically tractable information freshness metric has led to a wide body of work on the timing performance of Internet of things (IoT) systems. However, the spatial correlation inherent to environmental monitoring has been mostly neglected in the recent literature, due to the significant modeling complexity it introduces. In this work, we address this gap by presenting a model of spatio-temporal information freshness, considering the conditional entropy of the system state in a remote monitoring scenario, such as a low-orbit satellite collecting information from a wide geographical area. Our analytical results show that purely age-oriented schemes tend to select an overly broad communication range, leading to inaccurate estimates and energy inefficiency, both of which can be mitigated by adopting a spatio-temporal approach.
中文: 本研究提出了一种物联网系统中信息新鲜度的时空模型,表明仅基于时效的方法会导致通信范围过大和估计不准确,而采用时空方法可有效缓解这些问题。
English: The study introduces a spatio-temporal model for information freshness in IoT systems, demonstrating that purely age-based approaches cause inefficient communication ranges and estimation errors, which can be resolved through this integrated method.
Authors:Bihan Xu, Shiwei Zhao, Runze Wu, Zhenya Huang, Jiawei Wang, Zhipeng Hu, Kai Wang, Haoyu Liu, Tangjie Lv, Le Li, Changjie Fan, Xin Tong, Jiangze Han
Abstract:
Within the domain of Massively Multiplayer Online (MMO) economy research, Agent-Based Modeling (ABM) has emerged as a robust tool for analyzing game economics, evolving from rule-based agents to decision-making agents enhanced by reinforcement learning. Nevertheless, existing works encounter significant challenges when attempting to emulate human-like economic activities among agents, particularly regarding agent reliability, sociability, and interpretability. In this study, we take a preliminary step in introducing a novel approach using Large Language Models (LLMs) in MMO economy simulation. Leveraging LLMs' role-playing proficiency, generative capacity, and reasoning aptitude, we design LLM-driven agents with human-like decision-making and adaptability. These agents are equipped with the abilities of role-playing, perception, memory, and reasoning, addressing the aforementioned challenges effectively. Simulation experiments focusing on in-game economic activities demonstrate that LLM-empowered agents can promote emergent phenomena like role specialization and price fluctuations in line with market rules.
Chinese: 本研究提出了一种利用大型语言模型(LLMs)模拟大型多人在线(MMO)游戏中类人经济活动的新方法,通过角色扮演、感知、记忆和推理能力有效解决了智能体可靠性、社交性和可解释性方面的挑战。
English: This study introduces a novel approach using Large Language Models (LLMs) to simulate human-like economic activities in Massively Multiplayer Online (MMO) games, addressing challenges in agent reliability, sociability, and interpretability through role-playing, perception, memory, and reasoning capabilities.
Authors:Kun Zhao, Bohao Yang, Chen Tang, Siyuan Dai, Haoteng Tang, Chenghua Lin, Liang Zhan
Abstract:
Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.
Chinese Summary: 本研究提出的SLIDE方法及其增强版DRE,通过自适应加权和双重优化机制融合大小语言模型的互补优势,在对话评估等开放任务中实现了更可靠且更符合人类评判标准的结果。
English Summary: The study introduces SLIDE and its enhanced version DRE, which integrate the complementary strengths of Small and Large Language Models through adaptive weighting and dual-refinement to achieve more reliable dialogue evaluation that better aligns with human judgment.
Authors:Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, Pascale Fung
Abstract:
Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
中文: WorldPrediction是一个基于视频的基准测试,旨在通过要求AI模型从干扰项中区分正确动作或序列来评估其世界建模和程序规划能力,结果显示当前顶尖模型的表现远不及人类。
English: WorldPrediction is a video-based benchmark designed to assess AI models' world modeling and procedural planning abilities by requiring them to distinguish correct actions or sequences from distractors, revealing that current top models significantly underperform compared to humans.
Authors:Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang
Abstract:
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
中文摘要:物体指代旨在检测图像中符合语言描述的所有对象,而本文提出的Rex-Thinker模型通过可验证的思维链推理机制实现可解释预测,并在无匹配对象时主动弃权,显著提升了任务的可靠性和泛化能力。
English Summary: Object referring requires detecting image objects matching a natural language description, and the proposed Rex-Thinker model enhances this by implementing verifiable chain-of-thought reasoning and trustworthy abstention when no matches exist.
Authors:Yide Ran, Wentao Guo, Jingwei Sun, Yanzhou Pan, Xiaodong Yu, Hao Wang, Jianwen Xie, Yiran Chen, Denghui Zhang, Zhaozhuo Xu
Abstract:
Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models' massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance at the same communication frequency. To further handle Non-IID drift, Meerkat leverages traceable local updates and forms a virtual path for each client. This virtual path mechanism reveals the GradIP phenomenon: the inner products between LLM pre-training gradients maintained by server and client gradients estimated via ZO converges for extreme Non-IID clients but oscillates for IID ones. This distinct behavior provides a signal for identifying clients with extreme data heterogeneity. Using this signal, Meerkat-vp is proposed to analyze GradIP trajectories to identify extreme Non-IID clients and applies early stopping to enhance aggregated model quality. Experiments confirm that Meerkat and Meerkat-vp significantly improve the efficiency and effectiveness of ZO federated LLM fine-tuning.
Chinese: 本研究提出了Meerkat稀疏零阶优化方法,通过仅更新极小的参数子集实现高效联邦大语言模型微调,有效解决通信成本和非独立同分布数据问题,且性能优于现有方法。
English: This work introduces Meerkat, a sparse zeroth-order optimization method that enables efficient federated fine-tuning of large language models by updating only a minimal subset of parameters, effectively addressing communication costs and Non-IID data challenges while outperforming existing approaches.
Authors:Qi Li, Runpeng Yu, Xinchao Wang
Abstract:
Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
中文: 多模态大语言模型在视频处理中存在数据隐私风险,为此提出的Vid-SME方法通过计算原始与时间反转视频帧的沙玛-米塔尔熵差,能有效检测视频理解模型中训练数据的使用情况。
English: Multimodal large language models (MLLMs) face significant data privacy risks with video content, prompting the development of Vid-SME, a novel membership inference method that uses Sharma-Mittal entropy differences between natural and reversed video frames to effectively identify training data usage in video understanding models.
Authors:Youshen Xiao, Yiling Shi, Ruixi Sun, Hongjiang Wei, Fei Gao, Yuyao Zhang
Abstract:
Dynamic Photoacoustic Computed Tomography (PACT) is an important imaging technique for monitoring physiological processes, capable of providing high-contrast images of optical absorption at much greater depths than traditional optical imaging methods. However, practical instrumentation and geometric constraints limit the number of acoustic sensors available around the imaging target, leading to sparsity in sensor data. Traditional photoacoustic (PA) image reconstruction methods, when directly applied to sparse PA data, produce severe artifacts. Additionally, these traditional methods do not consider the inter-frame relationships in dynamic imaging. Temporal resolution is crucial for dynamic photoacoustic imaging, which is fundamentally limited by the low repetition rate (e.g., 20 Hz) and high cost of high-power laser technology. Recently, Implicit Neural Representation (INR) has emerged as a powerful deep learning tool for solving inverse problems with sparse data, by characterizing signal properties as continuous functions of their coordinates in an unsupervised manner. In this work, we propose an INR-based method to improve dynamic photoacoustic image reconstruction from sparse-views and enhance temporal resolution, using only spatiotemporal coordinates as input. Specifically, the proposed INR represents dynamic photoacoustic images as implicit functions and encodes them into a neural network. The weights of the network are learned solely from the acquired sparse sensor data, without the need for external training datasets or prior images. Benefiting from the strong implicit continuity regularization provided by INR, as well as explicit regularization for low-rank and sparsity, our proposed method outperforms traditional reconstruction methods under two different sparsity conditions, effectively suppressing artifacts and ensuring image quality.
中文: 动态光声计算机断层扫描是一种深层成像技术,但存在传感器稀疏伪影和低时间分辨率问题,我们提出的隐式神经表示方法无需外部训练数据,可直接从稀疏数据重建高质量动态图像。
English: Dynamic PACT is a deep-penetration imaging technique that suffers from sparse sensor artifacts and low temporal resolution, but our proposed Implicit Neural Representation method reconstructs high-quality dynamic images directly from sparse data without external training.
Authors:Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
Abstract:
Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategies-PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)-to bolster spatial reasoning.
中文: OmniSpatial是一个基于认知心理学的综合性空间推理基准,涵盖四大类别并揭示当前视觉语言模型存在显著不足,同时提出了两种改进策略。
English: OmniSpatial is a comprehensive benchmark for spatial reasoning in vision-language models, covering four major categories and revealing significant limitations in current models despite proposed enhancement strategies.
Authors:Mingzhe Li, Gehao Zhang, Zhenting Wang, Shiqing Ma, Siqi Pan, Richard Cartwright, Juan Zhai
Abstract:
Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.
中文: 本文提出了一种新的文本到图像扩散模型提示反演技术,通过图像描述初始化与潜在空间优化相结合,在多个数据集上实现了图像相似度、文本对齐度和可解释性方面的卓越性能。
English: This paper introduces a novel prompt inversion technique for text-to-image diffusion models that leverages image captioning initialization and latent space refinement to achieve superior performance in image similarity, textual alignment, and interpretability across multiple datasets.
Authors:Mohamed Djilani, Thibault Simonetto, Karim Tit, Florian Tambon, Paul Récamier, Salah Ghamizi, Maxime Cordy, Mike Papadakis
Abstract:
Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage in-context learning to achieve strong performance without gradient updates or fine-tuning. However, their robustness to adversarial manipulation remains largely unexplored. In this work, we present a comprehensive study of the adversarial vulnerabilities of tabular FM, focusing on both their fragility to targeted test-time attacks and their potential misuse as adversarial tools. We show on three benchmarks in finance, cybersecurity and healthcare, that small, structured perturbations to test inputs can significantly degrade prediction accuracy, even when training context remain fixed. Additionally, we demonstrate that tabular FM can be repurposed to generate transferable evasion to conventional models such as random forests and XGBoost, and on a lesser extent to deep tabular models. To improve tabular FM, we formulate the robustification problem as an optimization of the weights (adversarial fine-tuning), or the context (adversarial in-context learning). We introduce an in-context adversarial training strategy that incrementally replaces the context with adversarial perturbed instances, without updating model weights. Our approach improves robustness across multiple tabular benchmarks. Together, these findings position tabular FM as both a target and a source of adversarial threats, highlighting the urgent need for robust training and evaluation practices in this emerging paradigm.
中文: 最新的表格基础模型虽通过上下文学习表现出色,但易受对抗性攻击影响,导致预测准确性下降,甚至可被用于攻击传统模型,因此需采用如对抗性上下文学习等鲁棒训练方法来提升其安全性。
English: Recent tabular foundational models demonstrate strong performance through in-context learning but are vulnerable to adversarial attacks that degrade accuracy and can be repurposed to attack conventional models, necessitating robust training methods like adversarial in-context learning to enhance their security.
Authors:Shu Yang, Fengtao Zhou, Leon Mayer, Fuxiang Huang, Yiliang Chen, Yihui Wang, Sunan He, Yuxiang Nie, Xi Wang, Ãmer Sümer, Yueming Jin, Huihui Sun, Shuchang Xu, Alex Qinyang Liu, Zheng Li, Jing Qin, Jeremy YuenChun Teoh, Lena Maier-Hein, Hao Chen
Abstract:
Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.
中文: 本研究提出首个视频级手术预训练框架SurgVISTA,通过联合时空表征学习从大规模手术视频数据中捕获动态特征,在多项手术任务中持续超越现有模型。
English: This work introduces SurgVISTA, the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data, consistently outperforming existing models across multiple surgical tasks.
Authors:Zhengdong Lu, Weikai Lu, Yiling Tao, Yun Dai, ZiXuan Chen, Huiping Zhuang, Cen Chen, Hao Peng, Ziqian Zeng
Abstract:
Despite significant advances in Large Language Models (LLMs), planning tasks still present challenges for LLM-based agents. Existing planning methods face two key limitations: heavy constraints and cascading errors. To address these limitations, we propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM). Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan. In addition, our approach incorporates a verification and refinement module, enabling error correction and conflict resolution. Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.
中文: 提出的DPPM范式通过并行分解任务、验证合并子计划来解决大语言模型的规划难题,在旅行规划任务中显著优于现有方法。
English: The proposed DPPM paradigm addresses planning challenges for LLMs by decomposing tasks into parallel subtasks, merging subplans with verification, and significantly outperforming existing methods in travel planning.
Authors:Hyungjoo Chae, Dongjin Kang, Jihyuk Kim, Beong-woo Kwak, Sunghyun Park, Haeju Park, Jinyoung Yeo, Moontae Lee, Kyungjae Lee
Abstract:
With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.
中文: 研究人员推出了长链思维数据集,利用短链思维大语言模型构建,旨在实现大型推理模型的独立开发,其质量接近R1,并能显著提升推理能力和强化学习效果。
English: Researchers introduce the Long CoT Collection, a dataset created using short chain-of-thought LLMs to enable independent development of large reasoning models, achieving quality comparable to R1 and enhancing reasoning skills and reinforcement learning outcomes.
Authors:Zeng Wang, Minghao Shao, Rupesh Karn, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Abstract:
Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.
中文: SALAD利用机器遗忘技术,在不需完全重新训练的情况下,有选择地清除硬件设计中大语言模型的数据污染、知识产权泄露和恶意代码,从而提升安全性。
English: SALAD employs machine unlearning to selectively eliminate data contamination, IP leakage, and malicious code from LLMs in hardware design, enhancing security without full retraining.
Authors:Zeng Wang, Minghao Shao, Rupesh Karn, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Abstract:
Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.
中文: SALAD利用机器遗忘技术,在不需完全重新训练的情况下,有选择地清除硬件设计中大语言模型的数据污染、知识产权泄露和恶意代码,从而提升安全性。
English: SALAD employs machine unlearning to selectively eliminate data contamination, IP leakage, and malicious code from LLMs in hardware design, enhancing security without full retraining.
Authors:Tim Woydt, Moritz Willig, Antonia Wüst, Lukas Helff, Wolfgang Stammer, Constantin A. Rothkopf, Kristian Kersting
Abstract:
Strong meta-learning capabilities for systematic compositionality are emerging as an important skill for navigating the complex and changing tasks of today's world. However, in presenting models for robust adaptation to novel environments, it is important to refrain from making unsupported claims about the performance of meta-learning systems that ultimately do not stand up to scrutiny. While Fodor and Pylyshyn famously posited that neural networks inherently lack this capacity as they are unable to model compositional representations or structure-sensitive operations, and thus are not a viable model of the human mind, Lake and Baroni recently presented meta-learning as a pathway to compositionality. In this position paper, we critically revisit this claim and highlight limitations in the proposed meta-learning framework for compositionality. Our analysis shows that modern neural meta-learning systems can only perform such tasks, if at all, under a very narrow and restricted definition of a meta-learning setup. We therefore claim that `Fodor and Pylyshyn's legacy' persists, and to date, there is no human-like systematic compositionality learned in neural networks.
中文: 元学习虽被提出作为实现神经网络系统组合性的途径,但现有系统仅在高度受限条件下才能执行此类任务,尚未实现人类水平的组合性能力,因此福多和派利夏恩的批评依然成立。
English: Meta-learning shows potential for achieving systematic compositionality in neural networks, but current systems only succeed under highly constrained conditions, failing to demonstrate human-like capabilities as originally criticized by Fodor and Pylyshyn.
Authors:Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, Limin Wang
Abstract:
While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description generation, enabling the model to produce captions with more accurate actions. Our experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline using limited samples (1.5k) across multiple video caption benchmarks (DREAM1K: +4.4 event F1, VDC: +4.2 Acc, CAREBENCH: +3.1 action F1, +6.9 object F1) while consistently outperforming the SFT-trained counterparts, confirming GRPO's superiority in enhancing MLLMs' captioning capabilities.
中文: 本文首次系统地将GRPO强化学习应用于视频多模态大语言模型的字幕生成,通过结构化思维与双重奖励机制,在多个基准测试中显著提升了动作和物体描述的准确性。
English: This paper introduces the first systematic application of GRPO-based reinforcement learning to enhance video captioning in multimodal large language models, demonstrating significant improvements in action and object description accuracy across benchmarks.
Authors:Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, Wynne Hsu
Abstract:
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
中文摘要:该摘要介绍了FormFactory,一个新的基准测试套件,旨在解决在线表单自动填写的挑战,当前多模态大语言模型在布局推理和字段对齐方面表现不佳,评估准确率不足5%。
English Summary: The abstract introduces FormFactory, a new benchmarking suite designed to address the challenges of automating online form filling, where current multimodal large language models struggle with layout reasoning and field alignment, achieving less than 5% accuracy in evaluations.
Authors:Shuzhou Yuan, Ercong Nie, Lukas Kouba, Ashish Yashwanth Kangen, Helmut Schmid, Hinrich Schütze, Michael Färber
Abstract:
Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hatespeech detoxification. We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.
中文摘要:本文提出ParaDeHate这一大规模仇恨言论净化平行数据集,通过采用GPT-4o-mini的LLM循环流水线构建,作为人工标注的可扩展替代方案,使BART等模型在风格准确性、内容保留和流畅性方面实现更优性能。
English Summary: This paper introduces ParaDeHate, a large-scale parallel dataset for hate speech detoxification created using an LLM-in-the-loop pipeline with GPT-4o-mini, which serves as a scalable alternative to human annotation and enables models like BART to achieve superior performance in style accuracy, content preservation, and fluency.
Authors:Juan Pablo Bertucci, Sudarshan Raghuraman, Mauro Salazar, Theo Hofman
Abstract:
The major challenges to battery electric truck adoption are their high cost and grid congestion.In this context, stationary energy storage systems can help mitigate both issues. Since their design and operation are strongly coupled, to make the best out of them, they should be jointly optimized. This paper presents a co-design framework for hybrid energy storage systems where their technology and sizing are optimized jointly with their operational strategies. Specifically, we consider a microgrid supporting truck chargers that consists of utility grid, solar panels, and energy storage systems including batteries, supercapacitors and flywheels. We frame the co-design problem as a mixed-integer linear program that can be solved with global optimality guarantees. We showcase our framework in a case-study of a distribution center in the Netherlands. Our results show that although the battery-only configuration is already competitive, adding supercapacitors or flywheel storage decrease total cost and increase energy sold back to the grid. Overall, the fully hybrid solution (Battery+Supercapacitors+Flywheel) offers the best outcomes, achieving the lowest overall cost (1.96\% lower compared to battery-only) and reduced grid dependency, but at a higher (2.6\%) initial investment.
中文摘要:本文提出混合储能系统的协同设计框架,通过联合优化技术选型、容量配置与运行策略来降低电动卡车充电微电网的成本和电网依赖,其中完全混合方案虽需更高初始投资但综合表现最佳。
English Summary: This paper introduces a co-design framework for hybrid energy storage systems that jointly optimizes technology selection, sizing, and operational strategies to reduce costs and grid dependency for electric truck charging microgrids, with the fully hybrid solution showing the best overall performance despite higher initial investment.
Authors:Nie Lin, Yansen Wang, Dongqi Han, Weibang Jiang, Jingyuan Li, Ryosuke Furuta, Yoichi Sato, Dongsheng Li
Abstract:
The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.
中文: EgoBrain推出了首个同步脑电图与第一人称视觉的大规模多模态数据集,通过多模态学习框架实现了66.70%的行为识别准确率,为人类行为分析开辟了新范式。
English: EgoBrain introduces the first large-scale multimodal dataset synchronizing EEG and egocentric vision, enabling a novel framework for human behavior analysis with a 66.70% action recognition accuracy through multimodal learning.
Authors:Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao
Abstract:
Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
Chinese: ReAgent-V是一种新型代理视频理解框架,通过实时奖励生成和多视角迭代优化显著提升推理能力,在多项应用中实现了性能突破。
English: ReAgent-V is a novel agentic video understanding framework that enhances reasoning through real-time reward generation and iterative answer refinement, achieving significant performance gains across multiple applications.
Authors:Brian Hu Zhang, Tuomas Sandholm
Abstract:
Since the advent of AI, games have served as progress benchmarks. Meanwhile, imperfect-information variants of chess have existed for over a century, present extreme challenges, and have been the focus of significant AI research. Beyond calculation needed in regular chess, they require reasoning about information gathering, the opponent's knowledge, signaling, etc. The most popular variant, Fog of War (FoW) chess (aka. dark chess) is a recognized challenge problem in AI after superhuman performance was reached in no-limit Texas hold'em poker. We present Obscuro, the first superhuman AI for FoW chess. It introduces advances to search in imperfect-information games, enabling strong, scalable reasoning. Experiments against the prior state-of-the-art AI and human players -- including the world's best -- show that Obscuro is significantly stronger. FoW chess is the largest (by amount of imperfect information) turn-based game in which superhuman performance has been achieved and the largest game in which imperfect-information search has been successfully applied.
中文: Obscuro是首个在战争迷雾象棋中实现超人类水平的人工智能,通过先进的搜索技术实现了强大的推理能力,显著超越了现有最佳AI和顶尖人类棋手。
English: Obscuro is the first superhuman AI for Fog of War chess, introducing advanced search techniques that enable strong reasoning and outperform both prior AI systems and top human players.
Authors:Asım Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
Abstract:
The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
中文: 本研究通过无监督分析方法,探究语音和多模态模型是否像大型语言模型一样能够形成语义概念,并比较不同训练模式下概念结构的异同。
English: This study investigates whether speech and multimodal models develop semantic concepts similar to large language models, using unsupervised analysis to explore conceptual structures across different training modalities.
Authors:Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan
Abstract:
Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, SE(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an SE(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder's latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.
中文: 本研究提出的MolFLAE方法通过变分自编码器构建等变潜在空间,实现无需重新训练的零样本分子编辑与性质优化,在药物优化任务中展现出实际应用价值。
English: This study introduces MolFLAE, a zero-shot 3D molecule manipulation method that uses a variational autoencoder to create an equivariant latent space, enabling structure editing and property optimization without retraining, as demonstrated in drug optimization tasks.
Authors:Mengke Li, Zhikai Hu, Yang Lu, Weichao Lan, Yiu-ming Cheung, Hui Huang
Abstract:
The imbalanced distribution of long-tailed data presents a significant challenge for deep learning models, causing them to prioritize head classes while neglecting tail classes. Two key factors contributing to low recognition accuracy are the deformed representation space and a biased classifier, stemming from insufficient semantic information in tail classes. To address these issues, we propose permutation-invariant and head-to-tail feature fusion (PI-H2T), a highly adaptable method. PI-H2T enhances the representation space through permutation-invariant representation fusion (PIF), yielding more clustered features and automatic class margins. Additionally, it adjusts the biased classifier by transferring semantic information from head to tail classes via head-to-tail fusion (H2TF), improving tail class diversity. Theoretical analysis and experiments show that PI-H2T optimizes both the representation space and decision boundaries. Its plug-and-play design ensures seamless integration into existing methods, providing a straightforward path to further performance improvements. Extensive experiments on long-tailed benchmarks confirm the effectiveness of PI-H2T.
中文: 提出的PI-H2T方法通过置换不变表示融合增强特征空间,并利用头尾特征融合传递语义信息,有效解决了长尾数据分布不均导致模型偏向头部类别的问题,显著提升了整体识别性能。
English: The proposed PI-H2T method addresses long-tailed data imbalance by enhancing feature representation through permutation-invariant fusion and transferring semantic information from head to tail classes, effectively optimizing both representation space and classifier performance.
Authors:Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, Dongbin Zhao
Abstract:
Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ensemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms ($\text{RLAE}_\text{PPO}$ and $\text{RLAE}_\text{MAPPO}$ ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to $3.3\%$ accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency.
中文: 提出的强化学习辅助集成框架通过强化学习动态调整集成权重,在多项任务中比传统方法准确率最高提升3.3%,同时保持更低延迟和更强泛化能力。
English: The proposed Reinforcement Learning-Assisted Ensemble (RLAE) framework dynamically adjusts ensemble weights through reinforcement learning, significantly outperforming conventional methods by up to 3.3% accuracy while maintaining lower latency and better generalization across tasks.
Authors:Robin Inho Kee, Taehyeun Kim, Anouck Girard, Ilya Kolmanovsky
Abstract:
This paper introduces a Time Shift Governor (TSG)-guided Model Predictive Controller with Control Barrier Functions (CBFs)-based constraints for adaptive cruise control (ACC). This MPC-CBF approach is defined for obstacle-free curved road tracking, while following distance and obstacle avoidance constraints are handled using standard CBFs and relaxed Collision Cone CBFs. In order to address scenarios involving rapidly moving obstacles or rapidly changing leading vehicle's behavior, the TSG augmentation is employed which alters the target reference to enforce constraints. Simulation results demonstrate the effectiveness of the TSG-guided MPC-CBF approach.
中文: 本文提出了一种基于时间偏移引导器的模型预测控制方法,结合控制屏障函数实现自适应巡航控制,通过动态调整参考目标有效处理道路跟踪与障碍物避让问题。
English: This paper presents a Time Shift Governor-enhanced Model Predictive Controller with Control Barrier Functions for adaptive cruise control, effectively handling road tracking and obstacle avoidance while adapting to dynamic scenarios through reference adjustment.
Authors:Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera
Abstract:
The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.
中文摘要:本研究提出了一种通过早期损失分析和分层重排序来优化Transformer推理步骤顺序的方法,成功发现了如反向数字顺序等高效学习序列,显著提升了算术任务的学习效果。
English Summary: This study introduces a method to optimize the order of reasoning steps in Transformers by identifying learning-friendly sequences through early loss analysis and hierarchical reordering, successfully discovering efficient orders like reverse-digit for arithmetic tasks.
Authors:Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek
Abstract:
Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.
中文摘要:本文提出系统嵌入扩散桥模型(SDBs),通过将已知线性测量系统嵌入随机微分方程系数的新型监督方法,在各类逆问题中实现稳定改进,并在系统失配情况下展现强大泛化能力。
English Summary: The paper introduces System embedded Diffusion Bridge Models (SDBs), a novel supervised approach that integrates known linear measurement systems into stochastic differential equations, achieving consistent improvements and robust generalization across diverse inverse problems.
Authors:Wei Zhou, Ji Sun, Xuanhe Zhou, Guoliang Li, Luyang Liu, Hao Wu, Tianyuan Wang
Abstract:
In the financial industry, data is the lifeblood of operations, and DBAs shoulder significant responsibilities for SQL tuning, database deployment, diagnosis, and service repair. In recent years, both database vendors and customers have increasingly turned to autonomous database platforms in an effort to alleviate the heavy workload of DBAs. However, existing autonomous database platforms are limited in their capabilities, primarily addressing single-point issues such as NL2SQL, anomaly detection, and SQL tuning. Manual intervention remains a necessity for comprehensive database maintenance. GaussMaster aims to revolutionize this landscape by introducing an LLM-based database copilot system. This innovative solution is designed not only to assist developers in writing efficient SQL queries but also to provide comprehensive care for database services. When database instances exhibit abnormal behavior, GaussMaster is capable of orchestrating the entire maintenance process automatically. It achieves this by analyzing hundreds of metrics and logs, employing a Tree-of-thought approach to identify root causes, and invoking appropriate tools to resolve issues. We have successfully implemented GaussMaster in real-world scenarios, such as the banking industry, where it has achieved zero human intervention for over 34 database maintenance scenarios. In this paper, we present significant improvements in these tasks with code at https://gitcode.com/opengauss/openGauss-GaussMaster.
Chinese: 金融行业高度依赖数据,现有自治数据库平台虽旨在减轻DBA负担,但多局限于处理单点问题,仍需人工干预;GaussMaster推出基于大语言模型的数据库协系统,通过分析指标、定位根因并解决问题,实现全自动维护,在银行等实际场景中已达成34个以上维护场景的零人工介入。
English: The financial industry relies heavily on data, and while autonomous database platforms aim to reduce DBA workloads, they often address only isolated issues, necessitating manual intervention; GaussMaster introduces an LLM-based copilot system that automates comprehensive database maintenance by analyzing metrics, identifying root causes, and resolving issues, achieving zero human intervention in over 34 scenarios in real-world applications like banking.
Authors:Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Merouane Debbah
Abstract:
Autonomous AI agents powered by large language models (LLMs) with structured function-calling interfaces have dramatically expanded capabilities for real-time data retrieval, complex computation, and multi-step orchestration. Yet, the explosive proliferation of plugins, connectors, and inter-agent protocols has outpaced discovery mechanisms and security practices, resulting in brittle integrations vulnerable to diverse threats. In this survey, we introduce the first unified, end-to-end threat model for LLM-agent ecosystems, spanning host-to-tool and agent-to-agent communications, formalize adversary capabilities and attacker objectives, and catalog over thirty attack techniques. Specifically, we organized the threat model into four domains: Input Manipulation (e.g., prompt injections, long-context hijacks, multimodal adversarial inputs), Model Compromise (e.g., prompt- and parameter-level backdoors, composite and encrypted multi-backdoors, poisoning strategies), System and Privacy Attacks (e.g., speculative side-channels, membership inference, retrieval poisoning, social-engineering simulations), and Protocol Vulnerabilities (e.g., exploits in Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent Network Protocol (ANP), and Agent-to-Agent (A2A) protocol). For each category, we review representative scenarios, assess real-world feasibility, and evaluate existing defenses. Building on our threat taxonomy, we identify key open challenges and future research directions, such as securing MCP deployments through dynamic trust management and cryptographic provenance tracking; designing and hardening Agentic Web Interfaces; and achieving resilience in multi-agent and federated environments. Our work provides a comprehensive reference to guide the design of robust defense mechanisms and establish best practices for resilient LLM-agent workflows.
中文: 本调查首次提出大语言模型智能体生态系统的统一威胁模型,系统归类四大攻击领域的三十余种威胁,并指明关键安全挑战以指导未来防御机制设计。
English: This survey presents the first unified threat model for LLM-agent ecosystems, categorizing over thirty attacks across four domains and identifying key security challenges to guide future defense mechanisms.
Authors:Dequan Kong, Zhe Zhu, Honghua Chen, Mingqiang Wei
Abstract:
Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.
中文: 现有基于条件扩散的3D形状补全方法因未显式建模全局最优传输路径且受体素分辨率限制而效果欠佳,而BridgeShape通过将补全构建为潜在空间中的最优传输问题,并利用深度增强的VQ-VAE结合DINOv2特征提升几何感知,实现了最先进的补全效果。
English: Existing 3D shape completion methods using conditional diffusion often produce suboptimal results due to inadequate modeling of global transport paths and resolution constraints, but BridgeShape overcomes these by formulating completion as an optimal transport problem in a compact latent space enhanced with depth and DINOv2 features, achieving state-of-the-art performance.
Authors:Qilin Shu, Qixian Zhang, Qi Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao
Abstract:
The person search task aims to locate a target person within a set of scene images. In recent years, transformer-based models in this field have made some progress. However, they still face three primary challenges: 1) the self-attention mechanism tends to suppress high-frequency components in the features, which severely impacts model performance; 2) the computational cost of transformers is relatively high. To address these issues, we propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search. HAMW is designed to enhance the discriminative feature extraction capabilities of transformers while reducing computational overhead and improving efficiency. Specifically, we develop a three-stage framework that progressively optimizes both detection and re-identification performance. Our model enhances the perception of high-frequency features by learning from augmented inputs containing additional high-frequency components. Furthermore, we replace the self-attention layers in the transformer with a strategy based on multi-level Haar wavelet fusion to capture multi-scale features. This not only lowers the computational complexity but also alleviates the suppression of high-frequency features and enhances the ability to exploit multi-scale information. Extensive experiments demonstrate that HAMW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.
中文摘要:本文提出的HAMW方法通过高频增强和多级小波融合策略,有效解决了行人搜索中Transformer模型对高频特征抑制和计算成本高的问题,在多个基准数据集上实现了最优性能。
English Summary: The proposed HAMW method enhances transformer-based person search by addressing high-frequency feature suppression and computational inefficiency through high-frequency augmentation and multi-wavelet fusion, achieving state-of-the-art results on benchmark datasets.
Authors:Zihao Teng, Jiancheng An, Lu Gan, Naofal Al-Dhahir, Zhu Han
Abstract:
Flexible intelligent metasurface (FIM) has emerged as a transformative technology to enhance wireless sensing by dynamically morphing its three-dimensional (3D) surface shape and electromagnetic response. Unlike conventional rigid arrays, an FIM consists of low-cost radiating elements that can independently adjust their positions and radiation characteristics, thereby allowing for real-time optimization of the sensing environment. This paper investigates the impact of FIM on wireless sensing performance. Specifically, we focus on the maximization of the cumulated power of the probing signals at the target locations under the per-antenna power constraint by jointly optimizing the transmit covariance matrix and the surface shape of the transmitting FIM. We propose a block coordinate descend (BCD) algorithm to find a locally optimal solution, by alternatively updating the FIM surface shape and the transmit covariance matrix, while keeping the other one fixed at each step. Furthermore, we analyze the computational complexity and convergence properties of the proposed algorithm and demonstrate that FIM enhances wireless sensing by providing a new design degree-of-freedom to coordinate the correlation between steering vectors at different angles. Numerical results demonstrate that FIM significantly improves wireless sensing performance under the considered multi-target scenario.
Chinese: 柔性智能超表面(FIM)技术通过动态调整三维形态和电磁特性来增强无线传感,采用所提出的算法优化目标处信号功率,并通过协调不同角度的导向矢量相关性显著提升多目标场景下的性能。
English: Flexible intelligent metasurface (FIM) technology enhances wireless sensing by dynamically adjusting its 3D shape and electromagnetic properties, using a proposed algorithm to optimize signal power at targets while improving performance through coordinated steering vector correlation.
Authors:Saif Khan Mohammed, Sandesh Rao Mattu, Nishant Mehrotra, Venkatesh Khammammetti, Robert Calderbank
Abstract:
We communicate over wireless channels by first estimating and then equalizing the effective channel. In Zak-OTFS (orthogonal time frequency space) modulation the carrier waveform is a pulse in the delay-Doppler (DD) domain, formally a quasi-periodic localized function with specific periods along delay and Doppler. When the channel delay spread is less than the delay period, and the channel Doppler spread is less than the Doppler period, the response to a single Zak-OTFS carrier provides an image of the scattering environment and can be used to predict the effective channel at all other carriers. This makes DD domain channel estimation straightforward, and there is no loss in spectral efficiency since it is possible to design data and pilot signals that are mutually unbiased. However, equalization in the DD domain has high complexity ${\mathcal O}(M^3N^3)$ where $M$, $N$ are respectively the number of delay and Doppler bins in an OTFS frame, and $MN$ is the number of information symbols.
We demonstrate that equalization in the frequency domain (FD) reduces complexity to only ${\mathcal O}(M^2 N^2)$ by taking advantage of the banded structure of the effective FD channel. We also derive a low-complexity method to reconstruct the effective FD channel from the estimated DD domain effective channel.
中文: Zak-OTFS调制通过时延-多普勒域导频实现高效信道估计且无频谱损失,但该域均衡复杂度高;而频域均衡利用信道带状结构,大幅降低了复杂度。
English: Zak-OTFS modulation enables efficient channel estimation without spectral loss by using delay-Doppler domain pilots, but equalization in this domain is complex, whereas frequency domain equalization reduces complexity significantly by leveraging the banded channel structure.
Authors:Dayong Su, Yafei Zhang, Huafeng Li, Jinxing Li, Yu Liu
Abstract:
Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration & Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN's adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method's effectiveness and significant advantages over existing approaches.
中文摘要:UniFuse是一种统一的医学图像融合框架,通过退化感知提示学习和自适应特征表示,在单一框架中联合优化图像对齐、恢复与融合任务,实验证明其性能显著优于现有方法。
English Summary: UniFuse is a unified medical image fusion framework that jointly optimizes alignment, restoration, and fusion through degradation-aware prompts and adaptive feature representation, demonstrating superior performance over existing methods.
Authors:Weiyin Xie, Chunxi Huang, Jiyao Wang, Dengbo He
Abstract:
Electric vehicles (EVs) are a promising alternative to fuel vehicles (FVs), given some unique characteristics of EVs, for example, the low air pollution and maintenance cost. However, the increasing prevalence of EVs is accompanied by widespread complaints regarding the high likelihood of motion sickness (MS) induction, especially when compared to FVs, which has become one of the major obstacles to the acceptance and popularity of EVs. Despite the prevalence of such complaints online and among EV users, the association between vehicle type (i.e., EV versus FV) and MS prevalence and severity has not been quantified. Thus, this study aims to investigate the existence of EV-induced MS and explore the potential factors leading to it. A survey study was conducted to collect passengers' MS experience in EVs and FVs in the past one year. In total, 639 valid responses were collected from mainland China. The results show that FVs were associated with a higher frequency of MS, while EVs were found to induce more severe MS symptoms. Further, we found that passengers' MS severity was associated with individual differences (i.e., age, gender, sleep habits, susceptibility to motion-induced MS), in-vehicle activities (i.e., chatting with others and watching in-vehicle displays), and road conditions (i.e., congestion and slope), while the MS frequency was associated with the vehicle ownership and riding frequency. The results from this study can guide the directions of future empirical studies that aim to quantify the inducers of MS in EVs and FVs, as well as the optimization of EVs to reduce MS.
中文: 电动汽车比燃油车引发更严重的晕车症状,其严重程度受个体差异、车内活动和路况影响,而晕车频率则与车辆拥有情况和乘坐频率相关。
English: Electric vehicles are linked to more severe motion sickness symptoms than fuel vehicles, with severity influenced by individual traits, in-vehicle activities, and road conditions, while frequency relates to vehicle ownership and riding habits.
Authors:Abdallah Lakhdari, Jiajie Li, Amani Abusafia, Athman Bouguettaya
Abstract:
Fall detection is critical to support the growing elderly population, projected to reach 2.1 billion by 2050. However, existing methods often face data scarcity challenges or compromise privacy. We propose a novel IoT-based Fall Detection as a Service (FDaaS) framework to assist the elderly in living independently and safely by accurately detecting falls. We design a service-oriented architecture that leverages Ultra-wideband (UWB) radar sensors as an IoT health-sensing service, ensuring privacy and minimal intrusion. We address the challenges of data scarcity by utilizing a Fall Detection Generative Pre-trained Transformer (FD-GPT) that uses augmentation techniques. We developed a protocol to collect a comprehensive dataset of the elderly daily activities and fall events. This resulted in a real dataset that carefully mimics the elderly's routine. We rigorously evaluate and compare various models using this dataset. Experimental results show our approach achieves 90.72% accuracy and 89.33% precision in distinguishing between fall events and regular activities of daily living.
中文: 提出的物联网跌倒检测即服务框架采用保护隐私的超宽带雷达传感器和生成模型解决数据稀缺问题,在老年人跌倒检测中实现了超过90%的准确率。
English: The proposed IoT-based Fall Detection as a Service (FDaaS) framework uses privacy-preserving UWB radar sensors and a generative model to overcome data scarcity, achieving over 90% accuracy in detecting falls among the elderly.
Authors:Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, Liefeng Bo
Abstract:
Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
中文:MirrorMe是一种基于LTX视频扩散变换器的实时音频驱动人像动画框架,通过身份注入、音频同步和渐进式训练等创新技术,实现了卓越的生成质量与时间一致性。
English: MirrorMe is a real-time audio-driven portrait animation framework that leverages the LTX video diffusion transformer and introduces innovations in identity preservation, audio synchronization, and progressive training to achieve superior fidelity and temporal coherence.
Authors:Najmeh Forouzandehmehr, Reza Yousefi Maragheh, Sriram Kollipara, Kai Zhao, Topojoy Biswas, Evren Korpeoglu, Kannan Achan
Abstract:
Automated content-aware layout generation -- the task of arranging visual elements such as text, logos, and underlays on a background canvas -- remains a fundamental yet under-explored problem in intelligent design systems. While recent advances in deep generative models and large language models (LLMs) have shown promise in structured content generation, most existing approaches lack grounding in contextual design exemplars and fall short in handling semantic alignment and visual coherence. In this work we introduce CAL-RAG, a retrieval-augmented, agentic framework for content-aware layout generation that integrates multimodal retrieval, large language models, and collaborative agentic reasoning. Our system retrieves relevant layout examples from a structured knowledge base and invokes an LLM-based layout recommender to propose structured element placements. A vision-language grader agent evaluates the layout with visual metrics, and a feedback agent provides targeted refinements, enabling iterative improvement. We implement our framework using LangGraph and evaluate it on the PKU PosterLayout dataset, a benchmark rich in semantic and structural variability. CAL-RAG achieves state-of-the-art performance across multiple layout metrics -- including underlay effectiveness, element alignment, and overlap -- substantially outperforming strong baselines such as LayoutPrompter. These results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution for automated layout generation.
中文摘要:CAL-RAG是一种创新框架,通过结合多模态检索、基于大语言模型的建议和迭代式智能体优化,显著提升了自动化布局生成的性能,在多项布局指标上达到领先水平。
English Summary: CAL-RAG is a novel framework that enhances automated layout generation by integrating multimodal retrieval, LLM-based recommendations, and iterative agentic refinement, achieving state-of-the-art performance on layout metrics.
Authors:Avash Palikhe, Zhenyu Yu, Zichong Wang, Wenbin Zhang
Abstract:
Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a 'black box' and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.
中文摘要:大语言模型因其“黑箱”特性面临透明度挑战,本文通过基于Transformer架构对可解释人工智能方法进行分类,系统评估其解释能力与应用,旨在推动透明可靠的大语言模型发展。
English Summary: Large Language Models (LLMs) face transparency challenges due to their "black box" nature, prompting the development of explainable AI (XAI) methods categorized by transformer architectures to enhance interpretability and guide future research toward responsible AI.
Authors:Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Yulin Luo, Junyu Lu, Chunkai Fan, Qiang Zhou, Yiming Zhao, Ning Liu Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang
Abstract:
Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO), which integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 36.19% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% without environmental reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.
中文: SEEA-R1框架通过Tree-GRPO和MGRM技术解决了强化微调中的稀疏奖励与泛化难题,成功实现了具身智能体的自我进化能力,在ALFWorld基准测试中取得了超越现有最佳模型的优异表现。
English: The SEEA-R1 framework introduces Tree-GRPO and MGRM to overcome sparse rewards and generalization limitations in reinforcement fine-tuning, enabling self-evolving embodied agents that achieve state-of-the-art performance on benchmarks like ALFWorld.
Authors:Wenhao Li, Hongkuan Zhang, Hongwei Zhang, Zhengxu Li, Zengjie Dong, Yafan Chen, Niranjan Bidargaddi, Hong Liu
Abstract:
Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.
中文: 当前医学语言模型依赖ICD编码进行诊断,但无法体现临床医生的细致推理,因此作者提出GARMLE-G框架,通过整合电子健康记录与临床指南生成基于证据的无幻觉建议,在高血压诊断中表现出优于现有方法的性能。
English: Current medical language models often rely on ICD codes for diagnosis, but these fail to capture clinicians' nuanced reasoning, so the authors introduce GARMLE-G, a framework that integrates EHR data with clinical practice guidelines to generate evidence-based, hallucination-free recommendations, demonstrating superior performance in hypertension diagnosis compared to existing methods.
Authors:Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan
Abstract:
Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.
中文摘要:BiMark是一种新颖的水印框架,通过创新的位翻转重加权机制、多层架构和信息编码方法,解决了现有方法在文本质量保持、模型无关检测和多比特信息嵌入方面的局限,实验证明其在保持文本质量和下游任务性能的同时显著提高了短文本提取率。
English Summary: BiMark is a novel watermarking framework that addresses the limitations of existing methods by achieving high text quality preservation, model-agnostic detection, and multi-bit message embedding through innovative mechanisms, validated by superior extraction rates and maintained performance in downstream tasks.
Authors:Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
Abstract:
Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.
中文: 稀疏自编码器(SAEs)因后训练和特征不稳定而存在可解释性局限,但在Transformer架构中引入TopK激活函数可实现内在稀疏表示,既保持模型性能,又提供强大的可解释性和可控性。
English: Sparse autoencoders (SAEs) face limitations in interpretability due to post-hoc training and feature instability, but integrating a TopK activation function into transformer architecture enables intrinsic sparse representations that maintain model performance while offering robust interpretability and controllability.
Authors:Junwen Wang, Oscar Maccormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Abstract:
Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.
Chinese: 本研究提出了两种基于树结构的语义损失函数,利用标签的层次化组织改进高光谱成像分割,在稀疏标注数据上达到最优性能,并能在不影响分布内像素分割的同时有效检测分布外像素。
English: This study introduces two tree-based semantic loss functions that leverage hierarchical label structures to improve hyperspectral imaging segmentation, achieving state-of-the-art performance on sparse annotations and enabling effective out-of-distribution detection without compromising in-distribution accuracy.
Authors:Junwen Wang, Oscar Maccormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Abstract:
Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.
Chinese: 本研究提出了两种基于树结构的语义损失函数,利用标签的层次化组织改进高光谱成像分割,在稀疏标注数据上达到最优性能,并能在不影响分布内像素分割的同时有效检测分布外像素。
English: This study introduces two tree-based semantic loss functions that leverage hierarchical label structures to improve hyperspectral imaging segmentation, achieving state-of-the-art performance on sparse annotations and enabling effective out-of-distribution detection without compromising in-distribution accuracy.
Authors:Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, Zicheng Liu
Abstract:
Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.
中文: 本文提出了一种新颖的一维二进制图像潜在表示方法,仅需标准方法1/32的标记数量即可实现具有竞争力的文生图性能,在保持高分辨率细节的同时显著提升了训练和推理速度。
English: This paper introduces a novel 1D binary latent representation for images that achieves competitive text-to-image generation performance with up to 32 times fewer tokens than standard methods, enabling faster training and inference while maintaining high-resolution details.
Authors:Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, Faisal Mahmood
Abstract:
Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.
Chinese: PathChat+ 是一种专为病理学设计的多模态大语言模型,通过海量数据训练显著超越现有模型,并结合SlideSeek系统实现全切片图像的自主推理分析,在疑难鉴别诊断中达到高精度且能生成可解释的总结报告。
English: PathChat+ is an advanced multimodal large language model designed for pathology, trained on extensive datasets to outperform existing models and integrated with SlideSeek for autonomous, reasoning-based analysis of whole-slide images, achieving high diagnostic accuracy and generating interpretable reports.
Authors:Yongqian Sun, Xijie Pan, Xiao Xiong, Lei Tao, Jiaju Wang, Shenglin Zhang, Yuan Yuan, Yuqi Li, Kunlin Jian
Abstract:
Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.
Chinese: 本文提出ClusterRCA新型框架,通过融合基于分类器和图结构的方法,利用多模态数据并在故障图上执行定制化随机游走,实现了高性能计算系统网络故障的精准诊断。
English: This paper introduces ClusterRCA, a novel framework that combines classifier-based and graph-based approaches to accurately diagnose network failures in HPC systems by leveraging multimodal data and performing customized random walks on failure graphs.
Authors:Anqi Mao, Mehryar Mohri, Yutao Zhong
Abstract:
The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizable $H$-consistent surrogate losses and further prove $H$-consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines.
本文针对多专家延迟学习问题,提出了新型代理损失函数和高效算法,在单阶段和双阶段学习场景中均实现了严格的理论一致性保证,并通过实验验证了其优越性能。
This paper introduces novel surrogate loss functions and algorithms for learning to defer with multiple experts, providing strong theoretical guarantees and experimental validation across single-stage and two-stage learning scenarios.
Authors:Rui Huang, Guangyao Zhai, Zuria Bauer, Marc Pollefeys, Federico Tombari, Leonidas Guibas, Gao Huang, Francis Engelmann
Abstract:
Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.
中文: VIPScene提出了一种创新框架,通过利用视频生成模型对三维物理世界的编码知识来创建逼真且结构一致的3D场景,结合视频生成与三维重建技术,在实验中显著优于现有方法。
English: VIPScene introduces a novel framework that leverages video generation models' 3D physical world knowledge to create realistic and structurally consistent 3D scenes, outperforming existing methods through integrated video generation and 3D reconstruction.
Authors:Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, Junfeng Hao
Abstract:
Multi-modal learning is a fast growing area in artificial intelligence. It tries to help machines understand complex things by combining information from different sources, like images, text, and audio. By using the strengths of each modality, multi-modal learning allows AI systems to build stronger and richer internal representations. These help machines better interpretation, reasoning, and making decisions in real-life situations. This field includes core techniques such as representation learning (to get shared features from different data types), alignment methods (to match information across modalities), and fusion strategies (to combine them by deep learning models). Although there has been good progress, some major problems still remain. Like dealing with different data formats, missing or incomplete inputs, and defending against adversarial attacks. Researchers now are exploring new methods, such as unsupervised or semi-supervised learning, AutoML tools, to make models more efficient and easier to scale. And also more attention on designing better evaluation metrics or building shared benchmarks, make it easier to compare model performance across tasks and domains. As the field continues to grow, multi-modal learning is expected to improve many areas: computer vision, natural language processing, speech recognition, and healthcare. In the future, it may help to build AI systems that can understand the world in a way more like humans, flexible, context aware, and able to deal with real-world complexity.
中文: 多模态学习通过整合图像、文本等多元数据提升人工智能的理解与推理能力,尽管面临数据异构性和安全性等挑战,推动着研究向更类人、自适应系统的方向发展。
English: Multi-modal learning enhances AI's ability to interpret and reason by integrating diverse data sources like images and text, though challenges such as data heterogeneity and security persist, driving ongoing research for more human-like, adaptable systems.
Authors:Kazuki Yoda, Kazuhiko Kawamoto, Hiroshi Kera
Abstract:
The hardness of learning a function that attains a target task relates to its input-sensitivity. For example, image classification tasks are input-insensitive as minor corruptions should not affect the classification results, whereas arithmetic and symbolic computation, which have been recently attracting interest, are highly input-sensitive as each input variable connects to the computation results. This study presents the first learning-based Quick Response (QR) code decoding and investigates learning functions of medium sensitivity. Our experiments reveal that Transformers can successfully decode QR codes, even beyond the theoretical error-correction limit, by learning the structure of embedded texts. They generalize from English-rich training data to other languages and even random strings. Moreover, we observe that the Transformer-based QR decoder focuses on data bits while ignoring error-correction bits, suggesting a decoding mechanism distinct from standard QR code readers.
中文摘要:本研究首次实现基于学习的QR码解码,证明Transformer模型能通过学习文本结构突破理论纠错极限,实现跨语言泛化,并展现出与传统解码器不同的数据位聚焦机制。
English Summary: This study demonstrates that Transformers can effectively decode QR codes beyond theoretical error-correction limits by learning text structures, showing generalization across languages and unique focus on data bits rather than error-correction bits.
Authors:Zechun Deng, Ziwei Liu, Ziqian Bi, Junhao Song, Chia Xin Liang, Joe Yeong, Junfeng Hao
Abstract:
This paper investigates real-time decision support systems that leverage low-latency AI models, bringing together recent progress in holistic AI-driven decision tools, integration with Edge-IoT technologies, and approaches for effective human-AI teamwork. It looks into how large language models can assist decision-making, especially when resources are limited. The research also examines the effects of technical developments such as DeLLMa, methods for compressing models, and improvements for analytics on edge devices, while also addressing issues like limited resources and the need for adaptable frameworks. Through a detailed review, the paper offers practical perspectives on development strategies and areas of application, adding to the field by pointing out opportunities for more efficient and flexible AI-supported systems. The conclusions set the stage for future breakthroughs in this fast-changing area, highlighting how AI can reshape real-time decision support.
本文探讨了低延迟人工智能、边缘物联网技术及人机协作在实时决策支持系统中的整合,针对资源限制等挑战,提出了未来发展路径和应用前景。
This paper explores the integration of low-latency AI, Edge-IoT technologies, and human-AI collaboration to enhance real-time decision support, addressing challenges like resource constraints while outlining future development directions and applications.
Authors:Felix Faltings, Hannes Stark, Regina Barzilay, Tommi Jaakkola
Abstract:
We develop ProxelGen, a protein structure generative model that operates on 3D densities as opposed to the prevailing 3D point cloud representations. Representing proteins as voxelized densities, or proxels, enables new tasks and conditioning capabilities. We generate proteins encoded as proxels via a 3D CNN-based VAE in conjunction with a diffusion model operating on its latent space. Compared to state-of-the-art models, ProxelGen's samples achieve higher novelty, better FID scores, and the same level of designability as the training set. ProxelGen's advantages are demonstrated in a standard motif scaffolding benchmark, and we show how 3D density-based generation allows for more flexible shape conditioning.
中文: ProxelGen是一种创新的蛋白质结构生成模型,它采用三维密度表示和混合VAE-扩散方法,在创新性、FID评分和形状条件灵活性方面均优于现有技术。
English: ProxelGen is a novel protein structure generative model that uses 3D density representations and a hybrid VAE-diffusion approach to achieve superior novelty, FID scores, and flexible shape conditioning compared to existing methods.
Authors:Hongyu Wu, Pengwan Yang, Yuki M. Asano, Cees G. M. Snoek
Abstract:
This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions, extending beyond traditional object-level 3D scene understanding and addressing both data and methodological challenges. Due to the expensive acquisition and annotation burden, existing datasets and methods are predominantly limited to object-level comprehension. To overcome the limitations of data and annotation availability, we introduce the 3D-PU dataset, the first large-scale 3D dataset with dense part annotations, created through an innovative and cost-effective method for constructing synthetic 3D scenes with fine-grained part-level annotations, paving the way for advanced 3D-part scene understanding. On the methodological side, we propose OpenPart3D, a 3D-input-only framework to effectively tackle the challenges of part-level segmentation. Extensive experiments demonstrate the superiority of our approach in open-vocabulary 3D scene understanding tasks at the part level, with strong generalization capabilities across various 3D scene datasets.
本文提出3D-PU数据集和OpenPart3D框架,通过解决数据稀缺问题实现了基于自然语言的部分级三维场景分割,突破了传统对象级理解的局限。
This paper introduces the 3D-PU dataset and OpenPart3D framework to enable part-level 3D scene segmentation from natural language, overcoming data scarcity and advancing beyond object-level understanding.
Authors:Varun Belagali, Pierre Marza, Srikar Yellapragada, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Stergios Christodoulidis, Maria Vakalopoulou, Dimitris Samaras
Abstract:
Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.
中文摘要:CDG-MAE提出了一种基于静态图像生成合成视角的自监督方法,有效解决了密集对应学习中的数据获取难题,其性能显著优于纯图像方法并大幅缩小了与视频方法的差距。
English Summary: CDG-MAE introduces a self-supervised method using synthetic views from static images to overcome data limitations in learning dense correspondences, significantly outperforming image-based approaches and narrowing the gap with video methods.
Authors:Matteo Rufolo, Dario Piga, Marco Forgione
Abstract:
Meta learning aims at learning how to solve tasks, and thus it allows to estimate models that can be quickly adapted to new scenarios. This work explores distributionally robust minimization in meta learning for system identification. Standard meta learning approaches optimize the expected loss, overlooking task variability. We use an alternative approach, adopting a distributionally robust optimization paradigm that prioritizes high-loss tasks, enhancing performance in worst-case scenarios. Evaluated on a meta model trained on a class of synthetic dynamical systems and tested in both in-distribution and out-of-distribution settings, the proposed approach allows to reduce failures in safety-critical applications.
中文: 本研究在元学习的系统辨识中采用分布式鲁棒优化方法,通过关注最坏情况下的任务来提升性能,并减少安全关键应用中的故障风险。
English: This study introduces a distributionally robust optimization method in meta learning for system identification, focusing on worst-case scenarios to improve performance and reduce failures in safety-critical applications.
Authors:Jiao Chen, Kehui Yao, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan
Abstract:
Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi-agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages-Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large-scale e-commerce data and live A/B testing show that CARTS significantly outperforms single-pass and chain-of-thought LLM baselines, delivering higher title relevance and improved user engagement metrics.
Chinese: 本文提出CARTS框架,通过多智能体协作分解特征提取、迭代优化和仲裁过程,为推荐系统生成精准简洁的标题,实验证明其显著提升标题相关性和用户参与度。
English: The paper introduces CARTS, a multi-agent LLM framework that enhances recommendation systems by generating highly relevant and concise titles through collaborative stages of feature extraction, iterative refinement, and arbitration, demonstrating superior performance in user engagement over existing methods.
Authors:Ninareh Mehrabi, Tharindu Kumarage, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Abstract:
Warning: This paper contains content that may be inappropriate or offensive.
AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real-world applications. This autonomy poses novel challenges for the safety of such systems, both in single- and multi-agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi-agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single-agent and multi-agent scenarios. We also present a new kaleidoscopic teaming framework that generates a diverse array of scenarios modeling real-world human societies. Our framework evaluates safety of agents in both single-agent and multi-agent setups. In single-agent setup, an agent is given a scenario that it needs to complete using the tools it has access to. In multi-agent setup, multiple agents either compete against or cooperate together to complete a task in the scenario through which we capture existing safety vulnerabilities in agents. We introduce new in-context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis. Lastly, we present appropriate metrics that can be used along with our framework to measure safety of agents. Utilizing our kaleidoscopic teaming framework, we identify vulnerabilities in various models with respect to their safety in agentic use-cases.
中文摘要:本文提出创新的"万花筒式组队"框架,通过模拟真实社会情境来评估单智能体与多智能体系统的安全性,弥补现有红队测试方法的不足,并开发了新的优化技术与评估指标来识别各类安全漏洞。
English Summary: This paper introduces a novel "kaleidoscopic teaming" framework to address limitations in current safety evaluation methods for AI agents, proposing new techniques to identify vulnerabilities in both single-agent and multi-agent scenarios through realistic social simulations.
Authors:Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao, Yongxing Dai, Yonghong Tian
Abstract:
Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.
中文: 本文提出了一种用于自动驾驶的实时异常检测方法,通过多模态异步混合网络结合事件和RGB相机数据,实现了高精度和毫秒级响应时间。
English: This paper introduces a real-time anomaly detection method for autonomous driving that combines event and RGB camera data through a multimodal asynchronous hybrid network, achieving both high accuracy and millisecond-level response times.
Authors:Chaonan Ji, Jinwei Qi, Peng Zhang, Bang Zhang, Liefeng Bo
Abstract:
In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.
中文: 本文提出一种基于扩散模型的视频头部替换框架,通过身份保持融合和解耦标志点重定向技术,在保留目标视频背景的同时实现头部表情编辑功能。
English: This paper introduces a diffusion-based framework for video head swapping that preserves target video backgrounds while enabling expression editing through identity-preserving fusion and disentangled landmark retargeting.
Authors:Can Lin, Daniele Affinita, Marco E. P. Zimmatore, Daniele Nardi, Domenico D. Bloisi, Vincenzo Suriani
Abstract:
Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature extraction to enhance ball detection performance. The proposed approach leverages a general-purpose pretrained model to generate pseudo-labels, which are then used in a suite of self-supervised pretext tasks -- including colorization, edge detection, and triplet loss -- to learn robust visual features without relying on manual annotations. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to ensure rapid adaptation to new deployment scenarios with minimal supervision. A new dataset comprising 10,000 labeled images from outdoor RoboCup SPL matches is introduced, used to validate the method, and made available to the community. Experimental results demonstrate that the proposed pipeline outperforms baseline models in terms of accuracy, F1 score, and IoU, while also exhibiting faster convergence.
中文: 本文提出了一种自监督学习框架,利用伪标签和前置任务来提升足球机器人球体检测能力,无需昂贵的人工标注,并通过新的户外数据集验证了其优越性能。
English: This paper introduces a self-supervised learning framework that uses pseudo-labels and pretext tasks to enhance ball detection for soccer robots, eliminating the need for costly manual annotations and demonstrating superior performance through a new outdoor dataset.
Authors:Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, Sheng Zhang, Xin Huang, Di Luo, Fan Yang, Fang Yang, Lifu Wang, Sicong Liu, Yixuan Tang, Yulin Cai, Zebin He, Tian Liu, Yuhong Liu, Jie Jiang, Linus, Jingwei Huang, Chunchao Guo
Abstract:
In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.
Hunyuan3D 2.5是一套强大的3D扩散模型,通过引入LATTICE形状基础模型和基于物理渲染的纹理生成技术,在创建高保真3D资产方面显著超越了先前方法。
Hunyuan3D 2.5 is a robust 3D diffusion model suite that introduces the LATTICE shape foundation model and PBR-enhanced texture generation, significantly outperforming previous methods in creating high-fidelity 3D assets.
Authors:Xiaojun Dong, Andy Li, Yan Gu, Yihan Sun
Abstract:
We propose Orionet, efficient parallel implementations of Point-to-Point Shortest Paths (PPSP) queries using bidirectional search (BiDS) and other heuristics, with an additional focus on batch PPSP queries. We present a framework for parallel PPSP built on existing single-source shortest paths (SSSP) frameworks by incorporating pruning conditions. As a result, we develop efficient parallel PPSP algorithms based on early termination, bidirectional search, A$^*$ search, and bidirectional A$^*$ all with simple and efficient implementations.
We extend our idea to batch PPSP queries, which are widely used in real-world scenarios. We first design a simple and flexible abstraction to represent the batch so PPSP can leverage the shared information of the batch. Orionet formalizes the batch as a query graph represented by edges between queried sources and targets. In this way, we directly extended our PPSP framework to batched queries in a simple and efficient way.
We evaluate Orionet on both single and batch PPSP queries using various graph types and distance percentiles of queried pairs, and compare it against two baselines, GraphIt and MBQ. Both of them support parallel single PPSP and A$^*$ using unidirectional search. On 14 graphs we tested, on average, our bidirectional search is 2.9$\times$ faster than GraphIt, and 6.8$\times$ faster than MBQ. Our bidirectional A$^*$ is 4.4$\times$ and 6.2$\times$ faster than the A$^*$ in GraphIt and MBQ, respectively. For batched PPSP queries, we also provide in-depth experimental evaluation, and show that Orionet provides strong performance compared to the plain solutions.
中文: Orionet提出了基于双向搜索和批处理的并行点对点最短路径查询高效算法,相比现有方法实现了显著的性能提升。
English: Orionet introduces efficient parallel algorithms for point-to-point shortest path queries using bidirectional search and batch processing, achieving significant speed improvements over existing methods.
Authors:Doyeop Kwak, Youngjoon Jang, Seongyu Kim, Joon Son Chung
Abstract:
Speech signals in real-world environments are frequently affected by various distortions such as additive noise, reverberation, and bandwidth limitation, which may appear individually or in combination. Traditional speech enhancement methods typically rely on either masking, which focuses on suppressing non-speech components while preserving observable structure, or mapping, which seeks to recover clean speech through direct transformation of the input. Each approach offers strengths in specific scenarios but may be less effective outside its target conditions. We propose the Erase and Draw Network (EDNet), a distortion-agnostic speech enhancement framework designed to handle a broad range of distortion types without prior assumptions about task or input characteristics. EDNet consists of two main components: (1) the Gating Mamba (GM) module, which adaptively combines masking and mapping through a learnable gating mechanism that selects between suppression (Erase) and reconstruction (Draw) based on local signal features, and (2) Phase Shift-Invariant Training (PSIT), a shift tolerant supervision strategy that improves phase estimation by enabling dynamic alignment during training while remaining compatible with standard loss functions. Experimental results on denoising, dereverberation, bandwidth extension, and multi distortion enhancement tasks show that EDNet consistently achieves strong performance across conditions, demonstrating its architectural flexibility and adaptability to diverse task settings.
中文摘要:Erase and Draw Network (EDNet) 是一种与失真类型无关的语音增强框架,通过门控机制自适应结合掩蔽和映射策略,在噪声、混响和带宽限制等多种失真条件下均表现出强大的性能。
English Summary: The Erase and Draw Network (EDNet) is a distortion-agnostic speech enhancement framework that adaptively combines masking and mapping strategies through a gating mechanism, achieving robust performance across diverse distortion types including noise, reverberation, and bandwidth limitations.
Authors:Jiyao Wang, Xiao Yang, Hao Lu, Dengbo He, Kaishun Wu
Abstract:
Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbf{G} and TTP\textbf{A} employing \textbf{P}riors (\textbf{GAP}) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.
中文: 本研究提出名为GAP的统一框架,通过分解面部信息并运用先验知识,在远程生理测量中同时解决多源同语义域泛化与测试时个性化适配问题。
English: The study introduces a unified framework called GAP that simultaneously addresses multi-source synsemantic domain generalization and test-time personalized adaptation for remote physiological measurement by disentangling facial information and applying priors across different stages.
Authors:Chenxi Wang, Yixuan Zhang, Lang Gao, Zixiang Xu, Zirui Song, Yanbo Wang, Xiuying Chen
Abstract:
Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.
中文摘要:大型语言模型内化了特定语言的推理偏差,BICAUSE双语数据集显示模型在中文中更关注因果链与句首连接词,在英语中分布更均衡,且会僵化应用习得语序,但在成功推理时能实现跨语言的语义抽象趋同。
English Summary: Large language models internalize language-specific reasoning biases, as demonstrated by the BICAUSE dataset showing distinct causal attention patterns in Chinese and English, with models rigidly applying learned word orders and converging on shared semantic abstractions during successful reasoning.
Authors:Emanuele Musumeci, Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi
Abstract:
Classical planning in AI and Robotics addresses complex tasks by shifting from imperative to declarative approaches (e.g., PDDL). However, these methods often fail in real scenarios due to limited robot perception and the need to ground perceptions to planning predicates. This often results in heavily hard-coded behaviors that struggle to adapt, even with scenarios where goals can be achieved through relaxed planning. Meanwhile, Large Language Models (LLMs) lead to planning systems that leverage commonsense reasoning but often at the cost of generating unfeasible and/or unsafe plans. To address these limitations, we present an approach integrating classical planning with LLMs, leveraging their ability to extract commonsense knowledge and ground actions. We propose a hierarchical formulation that enables robots to make unfeasible tasks tractable by defining functionally equivalent goals through gradual relaxation. This mechanism supports partial achievement of the intended objective, suited to the agent's specific context. Our method demonstrates its ability to adapt and execute tasks effectively within environments modeled using 3D Scene Graphs through comprehensive qualitative and quantitative evaluations. We also show how this method succeeds in complex scenarios where other benchmark methods are more likely to fail. Code, dataset, and additional material are released to the community.
Chinese: ContextMatters框架融合了大语言模型与经典规划方法,通过分层目标松弛机制,在复杂3D环境中显著提升了任务成功率与适应性。
English: The ContextMatters framework integrates Large Language Models and classical planning to enable hierarchical goal relaxation, significantly improving task success rates and adaptability in complex 3D environments.
Authors:Emanuele Musumeci, Michele Brienza, Francesco Argenziano, Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi
Abstract:
Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Code, dataset, and supplementary materials are available to the community at https://lab-rococo-sapienza.github.io/context-matters/.
Chinese: ContextMatters框架融合了大语言模型与经典规划方法,通过分层目标松弛机制,在复杂3D环境中显著提升了任务成功率与适应性。
English: The ContextMatters framework integrates Large Language Models and classical planning to enable hierarchical goal relaxation, significantly improving task success rates and adaptability in complex 3D environments.
Authors:Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, Sergey Levine
Abstract:
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
中文摘要:本研究提出DSRL方法,通过在对潜在噪声空间进行强化学习,无需修改原始策略权重即可实现行为克隆扩散策略的自主实时适应,具有高样本效率并有效提升现实世界策略性能。
English Summary: This research introduces DSRL, a sample-efficient reinforcement learning method that enables autonomous real-world adaptation of behaviorally cloned diffusion policies by optimizing in their latent noise space without modifying the original policy weights.
Authors:Shuo Xing, Lanqing Guo, Hongyuan Hua, Seoyoung Lee, Peiran Li, Yufei Wang, Zhangyang Wang, Zhengzhong Tu
Abstract:
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.
Multimodal Large Language Models (MLLMs) surprisingly perform better with visually degraded images than high-fidelity ones, leading to the development of Visual-Quality Test-Time Tuning (VQ-TTT), a lightweight module that dynamically adapts images to boost model accuracy across tasks without external resources.
English Summary:
Authors:Yusuf Sulistyo Nugroho, Farah Danisha Salam, Brittany Reid, Raula Gaikovina Kula, Kazumasa Shimari, Kenichi Matsumoto
Abstract:
Documenting code snippets is essential to pinpoint key areas where both developers and users should pay attention. Examples include usage examples and other Application Programming Interfaces (APIs), which are especially important for third-party libraries. With the rise of Large Language Models (LLMs), the key goal is to investigate the kinds of description developers commonly use and evaluate how well an LLM, in this case Llama, can support description generation. We use NPM Code Snippets, consisting of 185,412 packages with 1,024,579 code snippets. From there, we use 400 code snippets (and their descriptions) as samples. First, our manual classification found that the majority of original descriptions (55.5%) highlight example-based usage. This finding emphasizes the importance of clear documentation, as some descriptions lacked sufficient detail to convey intent. Second, the LLM correctly identified the majority of original descriptions as "Example" (79.75%), which is identical to our manual finding, showing a propensity for generalization. Third, compared to the originals, the produced description had an average similarity score of 0.7173, suggesting relevance but room for improvement. Scores below 0.9 indicate some irrelevance. Our results show that depending on the task of the code snippet, the intention of the document may differ from being instructions for usage, installations, or descriptive learning examples for any user of a library.
中文摘要: 本研究通过分析NPM代码片段发现开发者文档主要采用示例用法,并评估Llama大语言模型的描述生成能力,结果显示其能准确识别示例类型但生成内容与原始描述的相似度中等,仍有改进空间。
English Summary: This study analyzes developer documentation patterns using NPM code snippets and evaluates Llama LLM's ability to generate descriptions, finding it effectively identifies example-based usage but produces descriptions with moderate similarity to originals, indicating room for improvement.
Authors:Ziqiao Peng, Wentao Hu, Junyuan Ma, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Hui Tian, Jun He, Hongyan Liu, Zhaoxin Fan
Abstract:
Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk++.
中文: SyncTalk++通过动态肖像渲染器和面部同步控制器解决语音驱动说话头合成的同步问题,结合头部同步稳定器优化头部姿态,显著提升了真实感和渲染速度。
English: SyncTalk++ tackles the synchronization challenge in speech-driven talking head synthesis by integrating a Dynamic Portrait Renderer and Face-Sync Controller for lip, expression, and identity alignment, along with a Head-Sync Stabilizer for natural head movements, achieving superior realism and speed.
Authors:Yu Qi, Lipeng Gu, Honghua Chen, Liangliang Nan, Mingqiang Wei
Abstract:
Existing 3D visual grounding methods rely on precise text prompts to locate objects within 3D scenes. Speech, as a natural and intuitive modality, offers a promising alternative. Real-world speech inputs, however, often suffer from transcription errors due to accents, background noise, and varying speech rates, limiting the applicability of existing 3DVG methods. To address these challenges, we propose \textbf{SpeechRefer}, a novel 3DVG framework designed to enhance performance in the presence of noisy and ambiguous speech-to-text transcriptions. SpeechRefer integrates seamlessly with xisting 3DVG models and introduces two key innovations. First, the Speech Complementary Module captures acoustic similarities between phonetically related words and highlights subtle distinctions, generating complementary proposal scores from the speech signal. This reduces dependence on potentially erroneous transcriptions. Second, the Contrastive Complementary Module employs contrastive learning to align erroneous text features with corresponding speech features, ensuring robust performance even when transcription errors dominate. Extensive experiments on the SpeechRefer and peechNr3D datasets demonstrate that SpeechRefer improves the performance of existing 3DVG methods by a large margin, which highlights SpeechRefer's potential to bridge the gap between noisy speech inputs and reliable 3DVG, enabling more intuitive and practical multimodal systems.
中文: 现有3D视觉定位方法依赖精确文本提示,而语音输入常因转录错误受限;SpeechRefer框架通过捕捉语音声学相似性并采用对比学习对齐特征,显著提升了在噪声转录下的性能,为多模态系统提供更实用的解决方案。
English: Current 3D visual grounding methods depend on accurate text prompts, but speech inputs often contain transcription errors; the proposed SpeechRefer framework enhances robustness by leveraging acoustic similarities and contrastive learning to align speech and text features, significantly improving performance despite noisy transcriptions.
Authors:Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank
Abstract:
Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
中文摘要:本研究发现广泛使用的基于参考的摘要评估指标(特别是ROUGE)在不同参考集间存在显著不稳定性,这削弱了模型比较的可靠性且与人工评判相关性弱,建议纳入参考集差异以提升评估一致性。
English Summary: This study reveals that widely-used reference-based summarization metrics, particularly ROUGE, show significant instability across different reference sets, undermining reliable model comparisons and showing weak correlation with human judgments, recommending incorporating reference variation for more consistent evaluation.
Authors:Zeyuan Chen, Qiyang Yan, Yuanpei Chen, Tianhao Wu, Jiyao Zhang, Zihan Ding, Jinzhou Li, Yaodong Yang, Hao Dong
Abstract:
Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization. We propose ClutterDexGrasp, a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a geometry and spatially-embedded scene representation and a novel comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations. To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts. More details and videos are available at https://clutterdexgrasp.github.io/.
中文: 本研究提出ClutterDexGrasp框架,通过师生架构和模拟安全课程训练,实现了杂乱场景中目标导向灵巧抓取的零样本仿真到现实迁移,无需真实世界演示即展现强大泛化能力。
English: The study introduces ClutterDexGrasp, a two-stage teacher-student framework that achieves zero-shot sim-to-real transfer for closed-loop dexterous grasping in cluttered scenes through imitation learning and a safety curriculum, demonstrating robust generalization without real-world training data.
Authors:Xiyu Zhao, Qimei Cui, Weicai Li, Wei Ni, Ekram Hossain, Quan Z. Sheng, Xiaofeng Tao, Ping Zhang
Abstract:
Personalized federated learning (PFL), e.g., the renowned Ditto, strikes a balance between personalization and generalization by conducting federated learning (FL) to guide personalized learning (PL). While FL is unaffected by personalized model training, in Ditto, PL depends on the outcome of the FL. However, the clients' concern about their privacy and consequent perturbation of their local models can affect the convergence and (performance) fairness of PL. This paper presents PFL, called DP-Ditto, which is a non-trivial extension of Ditto under the protection of differential privacy (DP), and analyzes the trade-off among its privacy guarantee, model convergence, and performance distribution fairness. We also analyze the convergence upper bound of the personalized models under DP-Ditto and derive the optimal number of global aggregations given a privacy budget. Further, we analyze the performance fairness of the personalized models, and reveal the feasibility of optimizing DP-Ditto jointly for convergence and fairness. Experiments validate our analysis and demonstrate that DP-Ditto can surpass the DP-perturbed versions of the state-of-the-art PFL models, such as FedAMP, pFedMe, APPLE, and FedALA, by over 32.71% in fairness and 9.66% in accuracy.
中文:DP-Ditto是Ditto个性化联邦学习框架在差分隐私保护下的重要扩展,它在保障隐私的同时优化了模型收敛与性能公平性,并在准确性和公平性上显著优于现有先进方法。
English: DP-Ditto is a differentially private extension of the Ditto framework for personalized federated learning, enhancing privacy while maintaining a balance between model convergence, fairness, and outperforming existing methods in accuracy and fairness metrics.
Authors:Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the $Pass@K$ metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using $CoT$-$Pass@K$, we observe that RLVR can incentivize the generalization of correct reasoning for all values of $K$. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.
中文: 最新研究表明,基于可验证奖励的强化学习(RLVR)通过扩展数学和编程任务的推理边界,显著提升了大语言模型的推理能力,这得到了新型评估指标CoT-Pass@K和显示早期正确推理激励的理论框架的支持。
English: Recent research demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances LLM reasoning capabilities by extending reasoning boundaries in mathematical and coding tasks, supported by a novel CoT-Pass@K metric and theoretical framework showing early incentive for correct reasoning.
Authors:Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang
Abstract:
Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVR's potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.
中文: 最新研究表明,基于可验证奖励的强化学习(RLVR)通过扩展数学和编程任务的推理边界,显著提升了大语言模型的推理能力,这得到了新型评估指标CoT-Pass@K和显示早期正确推理激励的理论框架的支持。
English: Recent research demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances LLM reasoning capabilities by extending reasoning boundaries in mathematical and coding tasks, supported by a novel CoT-Pass@K metric and theoretical framework showing early incentive for correct reasoning.
Authors:Yinuo Zheng, Lipeng Gu, Honghua Chen, Liangliang Nan, Mingqiang Wei
Abstract:
3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.
中文: 本文提出UniSpace-3D方法,通过构建统一表示空间有效弥合3D视觉定位中视觉与文本的模态差异,在多个数据集上显著超越了基线模型。
English: The paper introduces UniSpace-3D, a novel method that bridges the gap between vision and text in 3D visual grounding by creating a unified representation space, significantly improving performance over existing approaches.
Authors:Arya Fayyazi, Mehdi Kamal, Massoud Pedram
Abstract:
This paper introduces MARCO (Multi-Agent Reinforcement learning with Conformal Optimization), a novel hardware-aware framework for efficient neural architecture search (NAS) targeting resource-constrained edge devices. By significantly reducing search time and maintaining accuracy under strict hardware constraints, MARCO bridges the gap between automated DNN design and CAD for edge AI deployment. MARCO's core technical contribution lies in its unique combination of multi-agent reinforcement learning (MARL) with Conformal Prediction (CP) to accelerate the hardware/software co-design process for deploying deep neural networks. Unlike conventional once-for-all (OFA) supernet approaches that require extensive pretraining, MARCO decomposes the NAS task into a hardware configuration agent (HCA) and a Quantization Agent (QA). The HCA optimizes high-level design parameters, while the QA determines per-layer bit-widths under strict memory and latency budgets using a shared reward signal within a centralized-critic, decentralized-execution (CTDE) paradigm. A key innovation is the integration of a calibrated CP surrogate model that provides statistical guarantees (with a user-defined miscoverage rate) to prune unpromising candidate architectures before incurring the high costs of partial training or hardware simulation. This early filtering drastically reduces the search space while ensuring that high-quality designs are retained with a high probability. Extensive experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate that MARCO achieves a 3-4x reduction in total search time compared to an OFA baseline while maintaining near-baseline accuracy (within 0.3%). Furthermore, MARCO also reduces inference latency. Validation on a MAX78000 evaluation board confirms that simulator trends hold in practice, with simulator estimates deviating from measured values by less than 5%.
中文: 本文提出MARCO框架,通过多智能体强化学习与保形预测相结合,在严格硬件约束下为边缘设备实现高效的神经架构搜索,大幅缩短搜索时间的同时保持模型精度。
English: This paper presents MARCO, a novel hardware-aware framework that combines multi-agent reinforcement learning with conformal prediction to efficiently perform neural architecture search for edge devices, significantly reducing search time while maintaining accuracy under strict hardware constraints.
Authors:Yuan Gao, Shaoyan Pan, Mingzhe Hu, Huiqiao Xie, Jill Remick, Chih-Wei Chang, Justin Roper, Zhen Tian, Xiaofeng Yang
Abstract:
Cone-beam CT (CBCT) is widely used in clinical radiotherapy for image-guided treatment, improving setup accuracy, adaptive planning, and motion management. However, slow gantry rotation limits performance by introducing motion artifacts, blurring, and increased dose. This work aims to develop a clinically feasible method for reconstructing high-quality CBCT volumes from consecutive limited-angle acquisitions, addressing imaging challenges in time- or dose-constrained settings. We propose a limited-angle (LA) geometry-integrated cycle-domain (LA-GICD) framework for CBCT reconstruction, comprising two denoising diffusion probabilistic models (DDPMs) connected via analytic cone-beam forward and back projectors. A Projection-DDPM completes missing projections, followed by back-projection, and an Image-DDPM refines the volume. This dual-domain design leverages complementary priors from projection and image spaces to achieve high-quality reconstructions from limited-angle (<= 90 degrees) scans. Performance was evaluated against full-angle reconstruction. Four board-certified medical physicists conducted assessments. A total of 78 planning CTs in common CBCT geometries were used for training and evaluation. The method achieved a mean absolute error of 35.5 HU, SSIM of 0.84, and PSNR of 29.8 dB, with visibly reduced artifacts and improved soft-tissue clarity. LA-GICD's geometry-aware dual-domain learning, embedded in analytic forward/backward operators, enabled artifact-free, high-contrast reconstructions from a single 90-degree scan, reducing acquisition time and dose four-fold. LA-GICD improves limited-angle CBCT reconstruction with strong data fidelity and anatomical realism. It offers a practical solution for short-arc acquisitions, enhancing CBCT use in radiotherapy by providing clinically applicable images with reduced scan time and dose for more accurate, personalized treatments.
中文摘要:本研究提出的有限角度几何集成循环域(LA-GICD)框架通过双域去噪扩散模型,仅需90度扫描即可重建高质量CBCT图像,在将采集时间和辐射剂量降低四倍的同时,保持了临床适用的图像质量。
English Summary: This study introduces a limited-angle geometry-integrated cycle-domain (LA-GICD) framework that uses dual-domain denoising diffusion models to reconstruct high-quality CBCT images from 90-degree scans, reducing acquisition time and radiation dose by fourfold while maintaining clinical image quality.
Authors:Bayu Fedra Abdullah, Yusuf Sulistyo Nugroho, Brittany Reid, Raula Gaikovina Kula, Kazumasa Shimari, Kenichi Matsumoto
Abstract:
Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain. This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions. Using a curated dataset of 100 real and 100 fake CVE-IDs, we manually analyzed the credibility and consistency of the model's outputs. The results show that ChatGPT generated plausible security advisories for 96% of given input real CVE-IDs and 97% of given input fake CVE-IDs, demonstrating a limitation in differentiating between real and fake IDs. Furthermore, when these generated advisories were reintroduced to ChatGPT to identify their original CVE-ID, the model produced a fake CVE-ID in 6% of cases from real advisories. These findings highlight both the strengths and limitations of ChatGPT in cybersecurity applications. While the model demonstrates potential for automating advisory generation, its inability to reliably authenticate CVE-IDs or maintain consistency upon re-evaluation underscores the risks associated with its deployment in critical security tasks. Our study emphasizes the importance of using LLMs with caution in cybersecurity workflows and suggests the need for further improvements in their design to improve reliability and applicability in security advisory generation.
中文: 本研究评估了ChatGPT基于CVE-ID生成安全公告、区分真伪标识符及从描述中提取CVE-ID的能力,发现其虽具自动化潜力,但在认证准确性和一致性方面存在显著缺陷。
English: This study evaluates ChatGPT's performance in generating security advisories from CVE-IDs, distinguishing real from fake identifiers, and extracting CVE-IDs from descriptions, revealing its potential for automation but significant limitations in authentication accuracy and consistency.
Authors:Kai Lan, Jiayong Zhu, Jiangtong Li, Dawei Cheng, Guang Chen, Changjun Jiang
Abstract:
Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.
中文:FinLMM-R1框架通过自动数据管道生成对齐的图文问答对,并结合多阶段奖励训练策略,有效解决了金融多模态推理的数据和效率难题,在多个基准测试中显著提升了模型性能。
English: The FinLMM-R1 framework addresses financial multimodal reasoning challenges through an automated data pipeline that generates aligned image-question pairs and a training strategy with reward mechanisms, significantly enhancing model performance across various benchmarks.
Authors:Jingxuan Zhang, Zhenhua Xu, Rui Hu, Wenpeng Xing, Xuhong Zhang, Meng Han
Abstract:
Large Language Models (LLMs) have become increasingly prevalent across various sectors, raising critical concerns about model ownership and intellectual property protection. Although backdoor-based fingerprinting has emerged as a promising solution for model authentication, effective attacks for removing these fingerprints remain largely unexplored. Therefore, we present Mismatched Eraser (MEraser), a novel method for effectively removing backdoor-based fingerprints from LLMs while maintaining model performance. Our approach leverages a two-phase fine-tuning strategy utilizing carefully constructed mismatched and clean datasets. Through extensive evaluation across multiple LLM architectures and fingerprinting methods, we demonstrate that MEraser achieves complete fingerprinting removal while maintaining model performance with minimal training data of fewer than 1,000 samples. Furthermore, we introduce a transferable erasure mechanism that enables effective fingerprinting removal across different models without repeated training. In conclusion, our approach provides a practical solution for fingerprinting removal in LLMs, reveals critical vulnerabilities in current fingerprinting techniques, and establishes comprehensive evaluation benchmarks for developing more resilient model protection methods in the future.
中文: 本研究提出MEraser方法,通过两阶段微调策略,在保持模型性能的同时,仅用少量样本即可有效移除大语言模型中的后门指纹,并能实现跨模型迁移擦除。
English: The study introduces MEraser, a novel two-phase fine-tuning method that effectively removes backdoor-based fingerprints from Large Language Models while preserving performance, using minimal data and enabling transferable erasure across models.
Authors:Zachary Doucet, Rishi Sharma, Martijn de Vos, Rafael Pires, Anne-Marie Kermarrec, Oana Balmau
Abstract:
Mixture-of-Experts (MoE) models offer computational efficiency during inference by activating only a subset of specialized experts for a given input. This enables efficient model scaling on multi-GPU systems that use expert parallelism without compromising performance. However, load imbalance among experts and GPUs introduces waiting times, which can significantly increase inference latency. To address this challenge, we propose HarMoEny, a novel solution to address MoE load imbalance through two simple techniques: (i) dynamic token redistribution to underutilized GPUs and (ii) asynchronous prefetching of experts from the system to GPU memory. These techniques achieve a near-perfect load balance among experts and GPUs and mitigate delays caused by overloaded GPUs. We implement HarMoEny and compare its latency and throughput with four MoE baselines using real-world and synthetic datasets. Under heavy load imbalance, HarMoEny increases throughput by 37%-70% and reduces time-to-first-token by 34%-41%, compared to the next-best baseline. Moreover, our ablation study demonstrates that HarMoEny's scheduling policy reduces the GPU idling time by up to 84% compared to the baseline policies.
中文: HarMoEny通过动态令牌重分配和异步专家预取技术,解决了专家混合模型的负载不均问题,在严重不平衡条件下将吞吐量提升高达70%,首词延迟降低41%。
English: HarMoEny enhances Mixture-of-Experts models by dynamically redistributing tokens to underutilized GPUs and prefetching experts asynchronously, achieving up to 70% higher throughput and 41% lower latency under load imbalance.
Authors:Guojun Huang, Jiancheng An, Lu Gan, Dusit Niyato, Mérouane Debbah, Tie Jun Cui
Abstract:
Semantic communication (SemCom) powered by generative artificial intelligence enables highly efficient and reliable information transmission. However, it still necessitates the transmission of substantial amounts of data when dealing with complex scene information. In contrast, the stacked intelligent metasurface (SIM), leveraging wave-domain computing, provides a cost-effective solution for directly imaging complex scenes. Building on this concept, we propose an innovative SIM-aided multi-modal SemCom system. Specifically, an SIM is positioned in front of the transmit antenna for transmitting visual semantic information of complex scenes via imaging on the uniform planar array at the receiver. Furthermore, the simple scene description that contains textual semantic information is transmitted via amplitude-phase modulation over electromagnetic waves. To simultaneously transmit multi-modal information, we optimize the amplitude and phase of meta-atoms in the SIM using a customized gradient descent algorithm. The optimization aims to gradually minimize the mean squared error between the normalized energy distribution on the receiver array and the desired pattern corresponding to the visual semantic information. By combining the textual and visual semantic information, a conditional generative adversarial network is used to recover the complex scene accurately. Extensive numerical results verify the effectiveness of the proposed multi-modal SemCom system in reducing bandwidth overhead as well as the capability of the SIM for imaging the complex scene.
中文: 该SIM辅助多模态语义通信系统通过智能超表面实现复杂场景的视觉成像,并结合文本信息的电磁波传输,有效降低带宽消耗,同时利用条件生成对抗网络精确重建场景。
English: The proposed SIM-aided multi-modal semantic communication system efficiently transmits complex scene information by combining visual imaging through stacked intelligent metasurfaces and textual data via electromagnetic waves, significantly reducing bandwidth usage while ensuring accurate scene reconstruction.
Authors:Navodini Wijethilake, Reuben Dorent, Marina Ivory, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Mohamed Okasha, Anna Oviedova, Hexin Dong, Bogyeong Kang, Guillaume Sallé, Luyi Han, Ziyuan Zhao, Han Liu, Yubo Fan, Tao Yang, Shahad Hardan, Hussain Alasmawi, Santosh Sanjeev, Yuzhou Zhuang, Satoshi Kondo, Maria Baldeon Calisto, Shaikh Muhammad Uzair Noman, Cancan Chen, Ipek Oguz, Rongguo Zhang, Mina Rezaei, Susana K. Lai-Yuen, Satoshi Kasai, Yunzhi Huang, Chih-Cheng Hung, Mohammad Yaqub, Lisheng Wang, Benoit M. Dawant, Cuntai Guan, Ritse Mann, Vincent Jaouen, Tae-Eui Kam, Li Zhang, Jonathan Shapey, Tom Vercauteren
Abstract:
The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking.
中文: crossMoDA挑战赛致力于从对比增强T1到T2核磁共振的无监督前庭神经鞘瘤与耳蜗分割研究,通过多中心数据和临床任务升级应对模态差异,在数据异质性增强背景下实现异常值减少,但耳蜗分割精度受肿瘤细分标注影响出现波动。
English: The crossMoDA challenge advances unsupervised segmentation of vestibular schwannoma and cochlea from ceT1 to T2 MRI, evolving with multi-institutional data and enhanced clinical tasks to address domain shifts while observing improved outlier reduction despite increased data heterogeneity.
Authors:Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki, Hervé Bredin
Abstract:
End-to-End Neural Diarization with Vector Clustering is a powerful and practical approach to perform Speaker Diarization. Multiple enhancements have been proposed for the segmentation model of these pipelines, but their synergy had not been thoroughly evaluated. In this work, we provide an in-depth analysis on the impact of major architecture choices on the performance of the pipeline. We investigate different encoders (SincNet, pretrained and finetuned WavLM), different decoders (LSTM, Mamba, and Conformer), different losses (multilabel and multiclass powerset), and different chunk sizes. Through in-depth experiments covering nine datasets, we found that the finetuned WavLM-based encoder always results in the best systems by a wide margin. The LSTM decoder is outclassed by Mamba- and Conformer-based decoders, and while we found Mamba more robust to other architecture choices, it is slightly inferior to our best architecture, which uses a Conformer encoder. We found that multilabel and multiclass powerset losses do not have the same distribution of errors. We confirmed that the multiclass loss helps almost all models attain superior performance, except when finetuning WavLM, in which case, multilabel is the superior choice. We also evaluated the impact of the chunk size on all aforementioned architecture choices and found that newer architectures tend to better handle long chunk sizes, which can greatly improve pipeline performance. Our best system achieved state-of-the-art results on five widely used speaker diarization datasets.
中文: 本研究全面评估了端到端神经说话人日志系统中的关键架构选择,发现采用微调WavLM编码器与Conformer解码器配合多标签损失的组合在多个数据集上实现了最优性能。
English: This study thoroughly evaluates key architectural choices in end-to-end neural diarization systems, finding that fine-tuned WavLM encoders with Conformer decoders using multilabel loss achieve state-of-the-art performance across multiple datasets.
Authors:Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan
Abstract:
Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a compact encoder network. Subsequently, a vocoder synthesizes the clean speech from denoised embeddings. An ablation study substantiates the parameter efficiency of the denoise encoder with a pre-trained audioencoder and vocoder. Experimental results on both speech enhancement and speaker fidelity demonstrate that our generative audioencoder-based SE system outperforms models utilizing discriminative audioencoders. Furthermore, subjective listening tests validate that our proposed system surpasses an existing state-of-the-art SE model in terms of perceptual quality.
中文: 本文提出了一种高效的语音增强方法,通过预训练音频编码器提取噪声语音嵌入,经紧凑编码器去噪后由声码器合成纯净语音,在客观指标和主观听感上均优于现有先进模型。
English: This paper presents an efficient speech enhancement method that uses pre-trained audio embeddings, denoises them through a compact encoder, and synthesizes clean speech with a vocoder, outperforming existing models in both objective and subjective evaluations.
Authors:Junyong Lin, Lu Dai, Ruiqian Han, Yijie Sui, Ruilin Wang, Xingliang Sun, Qinglin Wu, Min Feng, Hao Liu, Hui Xiong
Abstract:
Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \& retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.
中文摘要:ScIRGen框架通过从学术论文中提取数据集信息、基于认知分类生成问题、利用大语言模型困惑度筛选答案,构建了更符合真实科研需求的大规模科学问答数据集,揭示了现有方法在处理复杂问题时的不足。
English Summary: The ScIRGen framework was developed to generate a large-scale scientific QA dataset that better reflects real-world research needs by extracting dataset information from papers, creating cognitively-grounded questions, and filtering answers using LLM perplexity, revealing current methods' limitations in handling complex queries.
Authors:Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He
Abstract:
Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.
中文: 提出的不确定性掩蔽伯努利扩散(UMBD)模型为伪装目标检测引入了首个生成式优化框架,通过不确定性引导的掩蔽机制选择性优化分割质量差的区域,在保持正确分割区域的同时以微小计算代价实现了显著性能提升。
English: The proposed Uncertainty-Masked Bernoulli Diffusion (UMBD) model introduces a generative refinement framework for Camouflaged Object Detection, using uncertainty-guided masking to selectively enhance poorly segmented regions while maintaining correct areas, achieving significant performance improvements with minimal computational cost.
Authors:Yucong Luo, Yitong Zhou, Mingyue Cheng, Jiahao Wang, Daoyu Wang, Tingyue Pan, Jintao Zhang
Abstract:
To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data-driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm-relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow-thinking LLMs (e.g., OpenAI-o1) have shown remarkable multi-step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations - including high computational cost, privacy risks, and limited capacity for in-depth domain-specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time-R1, a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability. Particularly, we design a fine-grained multi-objective reward specifically for time series forecasting, and then introduce GRIP (group-based relative importance for policy optimization), which leverages non-uniform sampling to further encourage and optimize the model's exploration of effective reasoning paths. Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.
中文: 该摘要提出了Time-R1框架,通过监督微调与强化学习两阶段训练,结合专门设计的时间序列奖励机制,显著提升大语言模型在时序预测中的多步推理能力,并在多个数据集上验证了其卓越性能。
English: The abstract introduces Time-R1, a two-stage reinforcement fine-tuning framework that enhances LLMs' multi-step reasoning for time series forecasting by combining supervised fine-tuning with reinforcement learning and a specialized reward mechanism, achieving superior performance across datasets.
Authors:Jin Huang, Honghua Chen, Mingqiang Wei
Abstract:
The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.
中文: 本文提出HEA-MM分层误差评估框架,通过结构光扫描和基于优化的方法,在全局、部件和特征三个层次上分析飞机CAD模型,以确保制造质量。
English: This paper introduces HEA-MM, a hierarchical error assessment framework that evaluates aircraft CAD models through global, part, and feature-level analyses using structured light scanning and optimization-based methods to ensure manufacturing quality.
Authors:Xueying Du, Kai Yu, Chong Wang, Yi Zou, Wentai Deng, Zuoyu Ou, Xin Peng, Lingming Zhang, Yiling Lou
Abstract:
Static bug analyzers play a crucial role in ensuring software quality. However, existing analyzers for bug detection in large codebases often suffer from high false positive rates. This is primarily due to the limited capabilities of analyzers in path feasibility validation with multiple conditional branches and complex data dependencies. While current LLM-based approaches attempt to address this issue, their effectiveness remains limited due to insufficient constraint cascade analysis and scalability challenges in large projects. To address this challenge, we propose an iterative path feasibility analysis framework LLM4PFA. By leveraging LLM agent based targeted constraint reasoning, and key context-aware analysis driven by agent planning, LLM4PFA effectively enhances complex inter-procedural path feasibility analysis for minimizing false positives in static bug detection. Evaluation results show that LLM4PFA precisely filters out 72% to 96% false positives reported during static bug detection, significantly outperforming all the baselines by 41.1% - 105.7% improvements; meanwhile LLM4PFA only misses 3 real bugs of 45 true positives.
Chinese: 提出的LLM4PFA框架通过基于LLM的迭代路径可行性分析和约束推理,显著提升了静态错误检测能力,在减少72%至96%误报的同时仅遗漏了45个真实错误中的3个。
English: The proposed LLM4PFA framework enhances static bug detection by employing iterative path feasibility analysis with LLM-driven constraint reasoning, effectively reducing false positives by 72% to 96% while maintaining high accuracy in identifying true bugs.
Authors:Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum
Abstract:
Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41\% \& $\downarrow$ 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.
中文: 尽管大型推理模型在复杂STEM任务中表现出色,但在TTT-Bench的简单策略游戏中却频频失败,显示出其在基础推理能力上与人类存在显著差距。
English: Large reasoning models, despite excelling in complex STEM tasks, often fail at simple strategic games in TTT-Bench, revealing significant gaps in basic reasoning abilities compared to humans.
Authors:Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu
Abstract:
Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ''boomerang'' shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only $5 \sim 10$ function evaluations.
中文摘要:基于扩散的生成模型通过微分方程实现复杂数据与简单分布间的转换,本研究发现无论模型架构或生成内容如何,所有采样轨迹均呈现一致的极低维“回旋镖”形态,据此提出的动态规划时间调度方案能以微小计算成本显著提升图像生成质量。
English Summary: Diffusion-based generative models use differential equations to transform complex data into simpler distributions, and this study uncovers that all sampling trajectories consistently form an identical low-dimensional "boomerang" shape regardless of variables, leading to an optimized time scheduling method that enhances image generation efficiency with minimal adjustments.
Authors:Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, Xi Luo
Abstract:
Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two popular microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks.
Chinese: TrioXpert是一种创新的端到端事件管理框架,利用多模态数据和大型语言模型协同处理异常检测、故障分类和根因定位任务,同时提供清晰的推理证据以确保强可解释性。
English: TrioXpert is an innovative end-to-end incident management framework that leverages multimodal data and large language models to simultaneously handle anomaly detection, failure triage, and root cause localization tasks while providing clear reasoning evidence for enhanced interpretability.
Authors:Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu
Abstract:
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.
中文: 本研究提出了OctoNav-Bench基准数据集和OctoNav-R1方法,通过构建自由形式的多模态指令数据集和"三思而后行"训练范式,开发能够遵循复杂指令的通用导航智能体,显著提升了导航性能。
English: This work introduces OctoNav-Bench, a comprehensive benchmark for developing generalist navigation agents that follow free-form multi-modal instructions, and OctoNav-R1, a vision-language-action model trained with a hybrid paradigm to enhance reasoning through thinking-before-action, achieving superior performance.
Authors:Taku Okawara, Kenji Koide, Aoki Takanose, Shuji Oishi, Masashi Yokozuka, Kentaro Uno, Kazuya Yoshida
Abstract:
In this letter, we present tightly coupled LiDAR-IMU-leg odometry, which is robust to challenging conditions such as featureless environments and deformable terrains. We developed an online learning-based leg kinematics model named the neural leg kinematics model, which incorporates tactile information (foot reaction force) to implicitly express the nonlinear dynamics between robot feet and the ground. Online training of this model enhances its adaptability to weight load changes of a robot (e.g., assuming delivery or transportation tasks) and terrain conditions. According to the \textit{neural adaptive leg odometry factor} and online uncertainty estimation of the leg kinematics model-based motion predictions, we jointly solve online training of this kinematics model and odometry estimation on a unified factor graph to retain the consistency of both. The proposed method was verified through real experiments using a quadruped robot in two challenging situations: 1) a sandy beach, representing an extremely featureless area with a deformable terrain, and 2) a campus, including multiple featureless areas and terrain types of asphalt, gravel (deformable terrain), and grass. Experimental results showed that our odometry estimation incorporating the \textit{neural leg kinematics model} outperforms state-of-the-art works. Our project page is available for further details: https://takuokawara.github.io/RAL2025_project_page/
中文: 本研究提出了一种紧密耦合的LiDAR-IMU-腿足里程计系统,通过在线学习的神经腿足运动学模型,在特征缺失环境和可变形地形中实现了更强的适应性,经实验验证其性能优于现有先进方法。
English: This study introduces a tightly coupled LiDAR-IMU-leg odometry system enhanced by an online learning-based neural leg kinematics model, which improves robustness in featureless environments and on deformable terrains through adaptive training and integrated factor graph optimization.
Authors:Cheng Chen, Yunpeng Zhai, Yifan Zhao, Jinyang Gao, Bolin Ding, Jia Li
Abstract:
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
Chinese: 本文提出一种强化学习框架,使大型视觉语言模型能够自主选择和优化多模态示例进行上下文学习,在多个视觉问答数据集上显著提升了模型性能。
English: This paper introduces a reinforcement learning framework that enables Large Vision-Language Models to autonomously select and optimize multi-modal demonstrations for in-context learning, significantly improving performance on visual question-answering tasks.
Authors:Yuhe Ding, Jian Liang, Bo Jiang, Zi Wang, Aihua Zheng, Bin Luo
Abstract:
CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.
中文: 提出的HAM框架通过协调模型更新和融合源模型,解决了CLIP领域泛化中的样本与优化冲突,在基准数据集上实现了最优性能。
English: The proposed Harmonizing and Merging (HAM) framework addresses sample and optimization conflicts in CLIP-based domain generalization by harmonizing model updates and merging source models, achieving state-of-the-art performance on benchmark datasets.
Authors:Yuxing Long, Jiyao Zhang, Mingjie Pan, Tianshu Wu, Taewhan Kim, Hao Dong
Abstract:
Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.
Chinese: 本文提出了首个基于手册的电器操作基准CheckManual,通过CAD模型生成手册并设立新挑战及ManualPlan规划模型,以评估模型性能。
English: This paper introduces CheckManual, the first benchmark for appliance manipulation that utilizes manuals generated from CAD models, featuring novel challenges and a planning model called ManualPlan to evaluate performance.
Authors:Telema Harry, Martin Guay, Shimin Wang, Richard D. Braatz
Abstract:
This article addresses the output regulation problem for a class of nonlinear systems using a data-driven approach. An output feedback controller is proposed that integrates a traditional control component with a data-driven learning algorithm based on Gaussian Process (GP) regression to learn the nonlinear internal model. Specifically, a data-driven technique is employed to directly approximate the unknown internal model steady-state map from observed input-output data online. Our method does not rely on model-based observers utilized in previous studies, making it robust and suitable for systems with modelling errors and model uncertainties. Finally, we demonstrate through numerical examples and detailed stability analysis that, under suitable conditions, the closed-loop system remains bounded and converges to a compact set, with the size of this set decreasing as the accuracy of the data-driven model improves over time.
本文提出一种结合传统控制与高斯过程回归的数据驱动输出反馈控制器,通过在线学习非线性内模实现无模型观测器的鲁棒输出调节,并证明闭环系统能在模型精度提升中渐近收敛。
This article introduces a data-driven output feedback controller combining traditional control with Gaussian Process regression to solve nonlinear system regulation, achieving robust performance without model-based observers through online learning and ensuring bounded closed-loop convergence.
Authors:Javier Lopez-Piqueres, Pranav Deshpande, Archan Ray, Mattia J. Villani, Marco Pistoia, Niraj Kumar
Abstract:
We present MetaTT, a unified Tensor Train (TT) adapter framework for global low-rank fine-tuning of pre-trained transformers. Unlike LoRA, which fine-tunes each weight matrix independently, MetaTT uses a single shared TT to factorize all transformer sub-modules -- query, key, value, projection, and feed-forward layers -- by indexing the structural axes like layer and matrix type, and optionally heads and tasks. For a given rank, while LoRA adds parameters proportional to the product across modes, MetaTT only adds parameters proportional to the sum across modes leading to a significantly compressed final adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning schemes. We observe that when tested on standard language modeling benchmarks, MetaTT leads to the most reduction in the parameters while maintaining similar accuracy to LoRA and even outperforming other tensor-based methods. Unlike CP or other rank-factorizations, the TT ansatz benefits from mature optimization routines -- e.g., DMRG-style rank adaptive minimization in addition to Adam, which we find simplifies training. Because new modes can be appended cheaply, MetaTT naturally extends to shared adapters across many tasks without redesigning the core tensor.
中文: MetaTT提出了一种统一的张量链适配器框架,通过在所有子模块间共享单一张量链来全局微调预训练变换器,在保持与LoRA相当精度的同时大幅压缩参数规模,并优于其他基于张量的方法。
English: MetaTT introduces a unified Tensor Train adapter framework that globally fine-tunes pre-trained transformers by sharing a single TT across all sub-modules, achieving significant parameter compression while maintaining accuracy comparable to LoRA and outperforming other tensor-based methods.
Authors:Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, Qi Tian
Abstract:
Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets.
中文: 现有视频扩散模型加速技术常依赖统一启发式方法或时间嵌入变体,易因提示特定过拟合导致输出不一致,但本文提出统一幅度定律和幅度感知缓存(MagCache),通过误差建模和自适应缓存策略跳过不重要时间步,仅需单样本校准即可在Open-Sora和Wan 2.1上实现2.1倍和2.68倍加速,同时保持卓越视觉保真度。
English: Current video diffusion acceleration methods often depend on uniform heuristics or time-embedding variations, risking inconsistent outputs due to prompt-specific overfitting, but this paper introduces a unified magnitude law and a Magnitude-aware Cache (MagCache) that adaptively skips timesteps with minimal calibration, achieving significant speedups and superior visual fidelity on benchmarks.
Authors:Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang
Abstract:
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
中文摘要:PhyBlock基准测试表明,当前视觉语言模型在三维环境中的物理推理和复杂规划任务方面存在显著不足,即使采用思维链提示也收效甚微。
English Summary: The PhyBlock benchmark reveals significant limitations in current vision-language models' ability to handle physical reasoning and complex planning tasks in 3D environments, despite minimal improvement from chain-of-thought prompting.
Authors:Dongge Han, Menglin Xia, Daniel Madrigal Diaz, Samuel Kessler, Ankur Mallick, Xuchao Zhang, Mirian Del Carmen Hipolito Garcia, Jin Xu, Victor Rühle, Saravan Rajmohan
Abstract:
Small language models (SLMs) offer promising and efficient alternatives to large language models (LLMs). However, SLMs' limited capacity restricts their reasoning capabilities and makes them sensitive to prompt variations. To address these challenges, we propose a novel framework that enhances SLM reasoning capabilities through LLM generated blueprints. The blueprints provide structured, high-level reasoning guides that help SLMs systematically tackle related problems. Furthermore, our framework integrates a prompt template search mechanism to mitigate the SLMs' sensitivity to prompt variations. Our framework demonstrates improved SLM performance across various tasks, including math (GSM8K), coding (MBPP), and logic reasoning (BBH). Our approach improves the reasoning capabilities of SLMs without increasing model size or requiring additional training, offering a lightweight and deployment-friendly solution for on-device or resource-constrained environments.
中文: 本文提出了一种新颖框架,通过使用大语言模型生成的蓝图和提示模板搜索机制来增强小语言模型的推理能力,在不增加模型规模或额外训练的情况下,实现了多项任务性能的提升。
English: This paper introduces a novel framework that enhances small language models' reasoning capabilities by using LLM-generated blueprints and prompt template search, achieving improved performance across multiple tasks without increasing model size or requiring additional training.
Authors:Shigang Quan, Shui Liu, Zhenzhe Zheng, Fan Wu
Abstract:
Repeat consumption, such as repurchasing items and relistening songs, is a common scenario in daily life. To model repeat consumption, the repeat-aware recommendation has been proposed to predict which item will be re-interacted based on the user-item interactions. In this paper, we investigate various inherent characteristics to enhance the repeat-aware recommendation. Specifically, we explore these characteristics from two aspects: one is from the temporal aspect where we consider the time interval relationship in the user behavior sequence; the other is from the sequential aspect where we consider the sequential-level relationship in the user behavior sequence. And our intuition is that both the temporal pattern and sequential pattern will reflect users' intentions of repeat consumption. By utilizing these two patterns, a novel model called Temporal and Sequential repeat-aware Recommendation(TSRec for short) is proposed to enhance repeat-aware recommendation. TSRec has three main components: 1) User-specific Temporal Representation Module (UTRM), which encodes and extracts user historical repeat temporal information. 2)Item-specific Temporal Representation Module (ITRM), which incorporates item time interval information as side information to alleviate the data sparsity problem of user repeat behavior sequence. 3) Sequential Repeat-Aware Module (SRAM), which represents the similarity between the user's current and the last repeat sequences. Extensive experimental results on three public benchmarks demonstrate the superiority of TSRec over state-of-the-art methods. The implementation code is available https://anonymous.4open.science/r/TSRec-2306/.
Chinese: 本文提出TSRec模型,通过利用用户行为中的时间模式和序列关系来增强重复感知推荐,并在多个公开数据集上验证了其优于现有方法的性能。
English: This paper introduces TSRec, a novel model that enhances repeat-aware recommendations by leveraging temporal patterns and sequential relationships in user behavior, demonstrating superior performance over existing methods.
Authors:June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee
Abstract:
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
中文: 近期图像转视频模型常因过早接触高频细节而产生动态不足的视频,而提出的自适应低通引导(ALG)方法在保持图像质量和文本对齐的同时,显著提升了生成视频的动态效果。
English: Recent image-to-video models often produce overly static videos due to premature exposure to high-frequency details, but the proposed adaptive low-pass guidance (ALG) effectively enhances motion dynamics while maintaining image quality and text alignment.
Authors:Fred Xu, Song Jiang, Zijie Huang, Xiao Luo, Shichang Zhang, Adrian Chen, Yizhou Sun
Abstract:
Taxonomy Expansion, which models complex concepts and their relations, can be formulated as a set representation learning task. The generalization of set, fuzzy set, incorporates uncertainty and measures the information within a semantic concept, making it suitable for concept modeling. Existing works usually model sets as vectors or geometric objects such as boxes, which are not closed under set operations. In this work, we propose a sound and efficient formulation of set representation learning based on its volume approximation as a fuzzy set. The resulting embedding framework, Fuzzy Set Embedding (FUSE), satisfies all set operations and compactly approximates the underlying fuzzy set, hence preserving information while being efficient to learn, relying on minimum neural architecture. We empirically demonstrate the power of FUSE on the task of taxonomy expansion, where FUSE achieves remarkable improvements up to 23% compared with existing baselines. Our work marks the first attempt to understand and efficiently compute the embeddings of fuzzy sets.
中文: 本文提出了模糊集嵌入(FUSE)框架,通过将概念建模为模糊集来实现集合表示学习,确保集合操作下的封闭性,并在分类扩展任务中实现了高达23%的性能提升。
English: This paper introduces Fuzzy Set Embedding (FUSE), a novel framework for set representation learning that models concepts as fuzzy sets, ensuring closure under set operations and achieving up to 23% improvement in taxonomy expansion tasks.
Authors:Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han
Abstract:
Retrieval-Augmented Generation (RAG) systems fail at complex multi-hop reasoning because they rely on large language models to implicitly connect information from unstructured document collections. This fundamental limitation stems from treating retrieved passages as independent context rather than recognizing the intricate relationships that enable coherent reasoning chains.
We introduce SARG (Structure-Augmented Reasoning Generation), a post-retrieval framework that transforms traditional RAG pipelines by materializing explicit reasoning structures. SARG extracts {cause, relation, effect} triples from retrieved documents, constructs domain-adaptive graphs, and performs multi-hop traversal to discover reasoning chains that bridge query concepts to answers. Unlike existing approaches that modify retrieval mechanisms, SARG operates as a plug-and-play reasoning layer compatible with any RAG system.
Extensive evaluation across diverse domains: general QA, biomedical literature, and financial analysis demonstrates that SARG achieves substantial improvements over state-of-the-art RAG baselines. Crucially, SARG also provides full reasoning traceability through explicit inference chains, addressing the critical interpretability gap in current RAG systems.
Our results establish that explicit structural reasoning is not merely beneficial but essential for reliable complex question answering, offering a solution to RAG's implicit reasoning bottleneck.
中文: RAG系统因依赖隐式信息关联而难以处理复杂推理,但SARG框架通过从检索文档构建显式推理结构,在多个领域显著提升了性能与可解释性,解决了这一根本瓶颈。
English: RAG systems struggle with complex reasoning due to their implicit information connection approach, but SARG overcomes this by constructing explicit reasoning structures from retrieved documents, significantly improving performance and interpretability across multiple domains.
Authors:Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, Narunas Vaskevicius, Timm Linder, Bastian Leibe
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction, offering high-quality novel view synthesis while maintaining computational efficiency. In this paper, we extend the capabilities of 3DGS beyond pure scene representation by introducing an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D. Our method leverages feature-splatting techniques to associate semantic information with individual Gaussians, enabling fine-grained scene understanding. We incorporate Segment Anything Model instance masks with a contrastive loss formulation as guidance for the instance features to achieve accurate instance-level segmentation. Furthermore, we utilize language embeddings of a vision-language model, allowing for flexible, text-driven instance identification. This combination enables our system to identify and segment arbitrary objects in 3D scenes based on natural language descriptions. We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
中文: 本文提出的OpenSplat3D通过将特征喷洒技术与视觉语言模型相结合,无需人工标注即可实现开放词汇的3D实例分割,支持基于自然语言描述的任意物体识别与分割。
English: This paper introduces OpenSplat3D, an extension of 3D Gaussian Splatting that enables open-vocabulary 3D instance segmentation by integrating feature-splatting with vision-language models and contrastive learning for text-driven object identification.
Authors:Ali Hariri, Ãlvaro Arroyo, Alessio Gravina, Moshe Eliasof, Carola-Bibiane Schönlieb, Davide Bacciu, Kamyar Azizzadenesheli, Xiaowen Dong, Pierre Vandergheynst
Abstract:
ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing local graph structure. Despite their success, MPNNs are limited in their ability to capture long-range dependencies between nodes. This has led researchers to adapt MPNNs through rewiring or make use of Graph Transformers, which compromises the computational efficiency that characterized early spatial message-passing architectures, and typically disregards the graph structure. Almost a decade after its original introduction, we revisit ChebNet to shed light on its ability to model distant node interactions. We find that out-of-box, ChebNet already shows competitive advantages relative to classical MPNNs and GTs on long-range benchmarks, while maintaining good scalability properties for high-order polynomials. However, we uncover that this polynomial expansion leads ChebNet to an unstable regime during training. To address this limitation, we cast ChebNet as a stable and non-dissipative dynamical system, which we coin Stable-ChebNet. Our Stable-ChebNet model allows for stable information propagation, and has controllable dynamics which do not require the use of eigendecompositions, positional encodings, or graph rewiring. Across several benchmarks, Stable-ChebNet achieves near state-of-the-art performance.
Chinese: ChebNet在捕捉远程依赖方面相比MPNN和图变换器展现出竞争优势且扩展性良好,但其训练不稳定性通过Stable-ChebNet得到解决,该模型能确保信息稳定传播,无需复杂修改即可实现接近最优的性能。
English: ChebNet demonstrates competitive advantages over MPNNs and Graph Transformers in capturing long-range dependencies with good scalability, but its training instability is addressed by Stable-ChebNet, which ensures stable information propagation and achieves near state-of-the-art performance without complex modifications.
Authors:Peiran Li, Xinkai Zou, Zhuohang Wu, Ruifeng Li, Shuo Xing, Hanwen Zheng, Zhikai Hu, Yuping Wang, Haoxi Li, Qin Yuan, Yingmo Zhang, Zhengzhong Tu
Abstract:
Recent advances in large language models (LLMs) and vision-language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi-modal tool use. Despite their growing capabilities, today's agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi-agent coordination. In this work, we introduce SAFEFLOW, a new protocol-level framework for building trustworthy LLM/VLM-based agents. SAFEFLOW enforces fine-grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high-integrity decisions. To ensure robustness in concurrent multi-agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write-ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state-of-the-art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.
中文: 本文提出SAFEFLOW协议级框架,通过细粒度信息流控制和鲁棒的多智能体协调机制,提升基于大语言模型与视觉语言模型的智能体安全性与可靠性,并借助SAFEFLOWBENCH基准验证其卓越性能。
English: This abstract introduces SAFEFLOW, a protocol-level framework that enhances the security and reliability of LLM/VLM-based agents through fine-grained information flow control and robust multi-agent coordination mechanisms, validated by the SAFEFLOWBENCH benchmark to outperform existing methods.
Authors:Shakir Yousefi, Andreas Plesner, Till Aczel, Roger Wattenhofer
Abstract:
Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks $4.5 \times$ faster in wall-clock time, reduce the discretization gap by $98\%$, and reduce the number of unused gates by $100\%$.
Chinese: 研究人员采用Gumbel噪声和直通估计器的方法,将逻辑门网络训练速度提升4.5倍,离散化间隙减少98%,并实现门电路完全利用,显著提升了实际部署的效率和性能。
English: Researchers have developed a method using Gumbel noise and a straight-through estimator to accelerate logic gate network training by 4.5 times, reduce the discretization gap by 98%, and fully utilize all gates, thereby enhancing efficiency and performance for real-world deployment.
Authors:Bolin Chen, Shanzhi Yin, Goluck Konuko, Giuseppe Valenzise, Zihan Zhang, Shiqi Wang, Yan Ye
Abstract:
The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.
中文: 生成式人脸视频编码(GFVC)通过深度生成模型在超低码率下实现高保真人脸视频通信,超越了传统视频编码标准,而本综述通过系统分析和标准化探索,搭建了理论创新与工业应用之间的桥梁。
English: Generative Face Video Coding (GFVC) revolutionizes video compression by using deep generative models to enable high-fidelity face video communication at ultra-low bitrates, surpassing traditional standards like VVC, while this survey bridges theoretical innovations with industrial applications through comprehensive analysis and standardization efforts.
Authors:Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li
Abstract:
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.
Chinese: Mathesis提出了首个端到端的定理证明流程,通过强化学习自动形式化自然语言问题并生成形式化证明,在高考形式化基准测试和MiniF2F上取得了最先进的成果。
English: Mathesis introduces the first end-to-end theorem proving pipeline that autoformalizes natural language problems using reinforcement learning and generates formal proofs, achieving state-of-the-art results on benchmarks like Gaokao-Formal and MiniF2F.
Authors:Yanting Gao, Yepeng Liu, Junming Liu, Qi Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao
Abstract:
Exploring effective and transferable adversarial examples is vital for understanding the characteristics and mechanisms of Vision Transformers (ViTs). However, adversarial examples generated from surrogate models often exhibit weak transferability in black-box settings due to overfitting. Existing methods improve transferability by diversifying perturbation inputs or applying uniform gradient regularization within surrogate models, yet they have not fully leveraged the shared and unique features of surrogate models trained on the same task, leading to suboptimal transfer performance. Therefore, enhancing perturbations of common information shared by surrogate models and suppressing those tied to individual characteristics offers an effective way to improve transferability. Accordingly, we propose a commonality-oriented gradient optimization strategy (COGO) consisting of two components: Commonality Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low frequency regions, leveraging the fact that ViTs trained on the same dataset tend to rely more on mid-to-low frequency information for classification. IS employs adaptive thresholds to evaluate the correlation between backpropagated gradients and model individuality, assigning weights to gradients accordingly. Extensive experiments demonstrate that COGO significantly improves the transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.
中文: 提出的共性导向梯度优化(COGO)策略通过增强视觉变换器共有的中低频特征并抑制模型特异性,显著提升了黑盒攻击中对抗样本的迁移成功率,其表现优于现有最优方法。
English: The proposed commonality-oriented gradient optimization (COGO) strategy enhances adversarial example transferability by amplifying shared mid-to-low frequency features across Vision Transformers while suppressing model-specific characteristics, significantly outperforming existing methods in black-box attacks.
Authors:George Lydakis, Alexander Hermans, Ali Athar, Daan de Geus, Bastian Leibe
Abstract:
Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.
中文摘要:视频大语言模型在仅图像训练后展现出超预期的时序推理能力,而视频专项训练带来的提升微乎其微,表明现有模型未能充分利用视频中的时序特征。
English Summary: Video LLMs demonstrate unexpected temporal reasoning capabilities after image-only training, with minimal gains from video-specific training, suggesting current models underutilize temporal features in videos.
Authors:Ruoxuan Zhang, Jidong Gao, Bin Wen, Hongxia Xie, Chenming Zhang, Hong-Han Shuai, Wen-Huang Cheng
Abstract:
Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.
中文摘要:RecipeGen提出了首个基于食谱的多模态生成大规模基准,通过丰富的食谱数据和针对性评估指标,解决了现有数据集在步骤与视觉内容细粒度对齐方面的不足。
English summary: RecipeGen introduces the first comprehensive benchmark for recipe-based multimodal generation, addressing the lack of fine-grained alignment in existing datasets through extensive recipe data and specialized evaluation metrics.
Authors:Yitao Liu, Chenglei Si, Karthik Narasimhan, Shunyu Yao
Abstract:
Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.
Chinese: 提出的情境经验回放(CER)框架通过将过往经验整合到动态记忆缓冲区,使大语言模型智能体能够实现自我改进,在VisualWebArena和WebArena基准测试中分别取得了31.9%和36.7%的显著性能提升。
English: The proposed Contextual Experience Replay (CER) framework enables large language model agents to self-improve by synthesizing past experiences into a dynamic memory buffer, achieving significant performance improvements of 31.9% on VisualWebArena and 36.7% on WebArena benchmarks.
Authors:Wendi Sang, Kai Li, Runxuan Yang, Jianqiang Huang, Xiaolin Hu
Abstract:
Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.
中文总结:Swift-Net是一种新型的流式视听语音分离模型,通过轻量级视觉特征提取、高效的视听融合及因果转换技术实现实时处理,在多个基准数据集上展现出卓越性能。
English Summary: Swift-Net is a novel streaming audio-visual speech separation model that enables real-time processing through lightweight visual feature extraction, efficient audio-visual fusion, and causal transformation techniques, demonstrating superior performance on benchmark datasets.
Authors:Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji
Abstract:
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.
中文摘要:本研究提出E2H Reasoner方法,通过从易到难的课程学习安排任务来增强语言模型的推理能力,实验证明该方法能显著提升小型模型在多领域的推理表现。
English summary: This study introduces E2H Reasoner, a curriculum learning approach that schedules tasks from easy to hard to enhance language models' reasoning abilities through reinforcement learning, demonstrating improved performance for small models across multiple domains.
Authors:Jiaxin Pan, Mojtaba Nayyeri, Osama Mohammed, Daniel Hernandez, Rongchuan Zhang, Cheng Cheng, Steffen Staab
Abstract:
Temporal Knowledge Graphs (TKGs) store temporal facts with quadruple formats (s, p, o, t). Existing Temporal Knowledge Graph Embedding (TKGE) models perform link prediction tasks in transductive or semi-inductive settings, which means the entities, relations, and temporal information in the test graph are fully or partially observed during training. Such reliance on seen elements during inference limits the models' ability to transfer to new domains and generalize to real-world scenarios. A central limitation is the difficulty in learning representations for entities, relations, and timestamps that are transferable and not tied to dataset-specific vocabularies. To overcome these limitations, we introduce the first fully-inductive approach to temporal knowledge graph link prediction. Our model employs sinusoidal positional encodings to capture fine-grained temporal patterns and generates adaptive entity and relation representations using message passing conditioned on both local and global temporal contexts. Our model design is agnostic to temporal granularity and time span, effectively addressing temporal discrepancies across TKGs and facilitating time-aware structural information transfer. As a pretrained, scalable, and transferable model, POSTRA demonstrates strong zero-shot performance on unseen temporal knowledge graphs, effectively generalizing to novel entities, relations, and timestamps. Extensive theoretical analysis and empirical results show that a single pretrained model can improve zero-shot performance on various inductive temporal reasoning scenarios, marking a significant step toward a foundation model for temporal KGs.
中文摘要:本文提出了首个全归纳时序知识图谱链接预测模型POSTRA,通过正弦位置编码和时序上下文消息传递,实现了对未见实体、关系及时间戳的零样本泛化,有效解决了跨图谱的时序差异问题。
English Summary: This paper introduces POSTRA, the first fully-inductive model for temporal knowledge graph link prediction that uses sinusoidal encodings and message passing to achieve strong zero-shot generalization to unseen entities, relations, and timestamps across different temporal contexts.
Authors:Jiyao Wang, Suzan Ayas, Jiahao Zhang, Xiao Wen, Dengbo He, Birsen Donmez
Abstract:
Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity (EDA), and respiratory (RESP) signals across four datasets, where different drowsiness inducers (such as fatigue and low arousal) and assessment methods (subjective vs. objective) were used. Binary logistic regression models were built to identify the physiological metrics that are associated with drowsiness. Findings indicate that distinct different drowsiness inducers can lead to different physiological responses, and objective assessments were more sensitive than subjective ones in detecting drowsiness. Further, the increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA are robustly associated with increased drowsiness. The results enhance understanding of drowsiness detection and can inform future generalizable monitoring designs.
中文: 本研究发现不同疲劳诱发因素会导致不同的生理反应,客观评估比主观评估更敏感,并指出心率稳定性增加、呼吸幅度降低和基础皮电活动减少是疲劳的可靠指标。
English: This study identifies that different drowsiness inducers trigger distinct physiological responses, with objective assessments proving more sensitive than subjective ones, and highlights increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA as robust indicators of drowsiness.
Authors:Leon Mayer, Tim Rädsch, Dominik Michael, Lucas Luttner, Amine Yamlahi, Evangelia Christodoulou, Patrick Godau, Marcel Knopp, Annika Reinke, Fiona Kolbinger, Lena Maier-Hein
Abstract:
While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.
中文: 视觉语言模型在基础手术感知任务中表现良好,但在需要医学知识的任务上表现显著下降,且专业医疗模型目前表现不及通用模型,突显了需进一步优化以适应手术环境复杂性的必要性。
English: Vision Language Models (VLMs) perform well on basic surgical perception tasks but struggle with medical knowledge-dependent tasks, with specialized medical models currently underperforming generalist ones, highlighting the need for further development to address surgical complexities.
Authors:Yujia Huo, Jianchun Liu, Hongli Xu, Zhenguo Ma, Shilong Wang, Liusheng Huang
Abstract:
Federated fine-tuning (FedFT) of large language models (LLMs) has emerged as a promising solution for adapting models to distributed data environments while ensuring data privacy.
Existing FedFT methods predominantly utilize parameter-efficient fine-tuning (PEFT) techniques to reduce communication and computation overhead.
However, they often fail to adequately address the catastrophic forgetting, a critical challenge arising from continual adaptation in distributed environments. The traditional centralized fine-tuning methods, which are not designed for the heterogeneous and privacy-constrained nature of federated environments, struggle to mitigate this issue effectively. Moreover, the challenge is further exacerbated by significant variation in data distributions and device capabilities across clients, which leads to intensified forgetting and degraded model generalization. To tackle these issues, we propose FedBE, a novel FedFT framework that integrates an adaptive transformer block expansion mechanism with a dynamic trainable-block allocation strategy. Specifically, FedBE expands trainable blocks within the model architecture, structurally separating newly learned task-specific knowledge from the original pre-trained representations. Additionally, FedBE dynamically assigns these trainable blocks to clients based on their data distributions and computational capabilities. This enables the framework to better accommodate heterogeneous federated environments and enhances the generalization ability of the model.Extensive experiments show that compared with existing federated fine-tuning methods, FedBE achieves 12-74% higher accuracy retention on general tasks after fine-tuning and a model convergence acceleration ratio of 1.9-3.1x without degrading the accuracy of downstream tasks.
中文:FedBE是一种新颖的联邦微调框架,通过自适应扩展Transformer模块和动态分配可训练参数,有效解决分布式环境中的灾难性遗忘问题,显著提升了模型精度保持率和收敛速度。
English: FedBE is a novel federated fine-tuning framework that addresses catastrophic forgetting in distributed environments by adaptively expanding transformer blocks and dynamically allocating trainable parameters, achieving significant improvements in accuracy retention and convergence speed.
Authors:Yimei Liu, Yakun Ju, Yuan Rao, Hao Fan, Junyu Dong, Feng Gao, Qian Du
Abstract:
Three-dimensional digital urban reconstruction from multi-view aerial images is a critical application where deep multi-view stereo (MVS) methods outperform traditional techniques. However, existing methods commonly overlook the key differences between aerial and close-range settings, such as varying depth ranges along epipolar lines and insensitive feature-matching associated with low-detailed aerial images. To address these issues, we propose an Adaptive Depth Range MVS (ADR-MVS), which integrates monocular geometric cues to improve multi-view depth estimation accuracy. The key component of ADR-MVS is the depth range predictor, which generates adaptive range maps from depth and normal estimates using cross-attention discrepancy learning. In the first stage, the range map derived from monocular cues breaks through predefined depth boundaries, improving feature-matching discriminability and mitigating convergence to local optima. In later stages, the inferred range maps are progressively narrowed, ultimately aligning with the cascaded MVS framework for precise depth regression. Moreover, a normal-guided cost aggregation operation is specially devised for aerial stereo images to improve geometric awareness within the cost volume. Finally, we introduce a normal-guided depth refinement module that surpasses existing RGB-guided techniques. Experimental results demonstrate that ADR-MVS achieves state-of-the-art performance on the WHU, LuoJia-MVS, and München datasets, while exhibits superior computational complexity.
中文: 本文提出ADR-MVS方法,通过单目几何线索和交叉注意力差异学习生成自适应深度范围图,有效解决航拍图像深度估计难题,在多个数据集上实现最优性能并显著提升计算效率。
English: This paper introduces ADR-MVS, an adaptive depth range multi-view stereo method that leverages monocular geometric cues and cross-attention learning to enhance depth estimation accuracy for aerial images, achieving state-of-the-art results across multiple datasets with improved computational efficiency.
Authors:Quan Shi, Carlos E. Jimenez, Shunyu Yao, Nick Haber, Diyi Yang, Karthik Narasimhan
Abstract:
Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.
中文摘要:近期人工智能推理能力的进步与人类知识传递效果存在不一致性,为此提出的KITE评估框架首次通过大规模实验证明:模型基准性能并不能可靠预测人类从AI解释中学习的效果。
English Summary: Recent AI reasoning advances show inconsistent knowledge transfer to humans, prompting the development of the KITE framework which reveals that benchmark performance doesn't reliably predict human learning from AI explanations.
Authors:Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum
Abstract:
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
中文: VideoMarathon数据集通过提供9700小时多样化视频和330万问答对,解决了长视频标注稀缺的问题,并在此基础上开发了Hour-LLaVA模型,该模型采用记忆增强机制,在多个长视频语言理解基准测试中取得了最优性能。
English: The VideoMarathon dataset addresses the scarcity of long video annotations by providing 9,700 hours of diverse videos and 3.3 million QA pairs, enabling the development of Hour-LLaVA, a memory-enhanced model that achieves top performance in long video-language understanding tasks.
Authors:Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu
Abstract:
Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been released on https://av-reasoner.github.io.
Chinese: 当前多模态大语言模型在视频计数任务中存在不足,为此开发了CG-AV-Counting基准和AV-Reasoner模型,该模型虽取得领先性能,但在跨领域泛化方面仍有局限。
English: Current MLLMs face challenges in video counting tasks, leading to the creation of the CG-AV-Counting benchmark and AV-Reasoner model, which achieves top results but struggles with out-of-domain generalization.
Authors:Tyler Chen, Akshay Seshadri, Mattia J. Villani, Pradeep Niroula, Shouvanik Chakrabarti, Archan Ray, Pranav Deshpande, Romina Yalovetzky, Marco Pistoia, Niraj Kumar
Abstract:
Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being the KernelSHAP method (Lundberg & Lee, 2017). While related estimators such as unbiased KernelSHAP (Covert & Lee, 2021) and LeverageSHAP (Musco & Witter, 2025) are known to satisfy theoretical guarantees, bounds for KernelSHAP have remained elusive. We describe a broad and unified framework that encompasses KernelSHAP and related estimators constructed using both with and without replacement sampling strategies. We then prove strong non-asymptotic theoretical guarantees that apply to all estimators from our framework. This provides, to the best of our knowledge, the first theoretical guarantees for KernelSHAP and sheds further light on tradeoffs between existing estimators. Through comprehensive benchmarking on small and medium dimensional datasets for Decision-Tree models, we validate our approach against exact Shapley values, consistently achieving low mean squared error with modest sample sizes. Furthermore, we make specific implementation improvements to enable scalability of our methods to high-dimensional datasets. Our methods, tested on datasets such MNIST and CIFAR10, provide consistently better results compared to the KernelSHAP library.
Chinese: 本文提出了一个包含KernelSHAP在内的沙普利值估计器统一框架,首次为KernelSHAP提供了理论保证,并在多个数据集上展示了改进的性能和可扩展性。
English: This paper introduces a unified framework for Shapley value estimators, including KernelSHAP, and provides the first theoretical guarantees for KernelSHAP while demonstrating improved performance and scalability across various datasets.
Authors:Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, Yixuan Yuan
Abstract:
Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.
Chinese: 提出的Track Any Anomalous Object (TAO)框架引入了一种细粒度视频异常检测流程,将精细对象检测与像素级追踪相结合,无需阈值调优即可在复杂视频序列中实现更优的准确性和鲁棒性。
English: The proposed Track Any Anomalous Object (TAO) framework introduces a granular video anomaly detection pipeline that integrates fine-grained object detection with pixel-level tracking, eliminating the need for threshold tuning and achieving superior accuracy and robustness in complex video sequences.
Authors:Srikar Yellapragada, Alexandros Graikos, Zilinghan Li, Kostas Triaridis, Varun Belagali, Saarthak Kapse, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Tahsin Kurc, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras
Abstract:
The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.
中文: PixCell是首个基于扩散的组织病理学生成基础模型,通过无标注的大规模数据集训练,能合成高质量多样化图像,可替代真实数据训练判别模型、实现隐私保护的数据共享,并支持虚拟染色和针对性数据增强等任务。
English: PixCell is the first diffusion-based generative foundation model for histopathology, trained on a vast dataset without annotations to synthesize high-quality, diverse images that can replace real data for training discriminative models, facilitate privacy-preserving data sharing, and enable tasks like virtual staining and targeted data augmentation.
Authors:Mehdi Azarafza, Mojtaba Nayyeri, Faezeh Pasandideh, Steffen Staab, Achim Rettberg
Abstract:
Autonomous UAV operation necessitates reliable mathematical reasoning for tasks such as trajectory planning and power management. While traditional flight control relies on hardcoded equations, recent Large Language Models (LLMs) offer potential for more flexible problem-solving but struggle with reliably selecting and applying correct mathematical formulations and executing precise multi-step arithmetic. We propose RAG-UAV, a retrieval-augmented generation framework designed to improve the mathematical reasoning of several LLMs (including GPT o1/Turbo, Llama-3.2/3.3, Mistral, and DeepSeek R1) in UAV-specific contexts by providing access to relevant domain literature. To conduct an initial assessment, we introduce the UAV-Math-Bench, a 20-question problem set of UAV-centric mathematical problems across four difficulty levels. Our experiments demonstrate that incorporating retrieval substantially increases exact answer accuracy (achieving up to 75% with o1), reduces instances of incorrect formulation selection (from 25% without RAG to 5\% with RAG), and decreases numerical errors, reducing Mean Squared Error (MSE) by orders of magnitude for the best-performing models. This pilot study indicates that RAG can enable general-purpose LLMs to function as more reliable tools for engineering analysis, although direct real-time flight control requires further investigation and validation on a larger scale. All benchmark data, questions, and answers are publicly available.
Chinese: 本研究提出RAG-UAV检索增强生成框架,通过提供领域文献来增强大型语言模型在无人机应用中的数学推理能力,在UAV-Math-Bench测试中显著提高了准确率并降低了错误率。
English: This study introduces RAG-UAV, a retrieval-augmented generation framework that enhances the mathematical reasoning of large language models for UAV applications by providing domain-specific literature, significantly improving accuracy and reducing errors as demonstrated on the UAV-Math-Bench.
Authors:Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
Chinese: 本文提出了一种增量半监督学习流程,通过整合少量领域内标注数据和辅助数据集,并应用基于共识或命名实体识别的过滤方法来优化伪标签选择,相比单步微调显著提升了模型性能。
English: This paper introduces an incremental semi-supervised learning pipeline that enhances fine-tuning of ASR models by integrating limited in-domain labeled data with auxiliary datasets and applying consensus-based or NER filtering to improve pseudo-label selection, achieving significant performance gains over single-step fine-tuning.
Authors:Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
Chinese: 本文提出了一种增量半监督学习流程,通过整合少量领域内标注数据和辅助数据集,并应用基于共识或命名实体识别的过滤方法来优化伪标签选择,相比单步微调显著提升了模型性能。
English: This paper introduces an incremental semi-supervised learning pipeline that enhances fine-tuning of ASR models by integrating limited in-domain labeled data with auxiliary datasets and applying consensus-based or NER filtering to improve pseudo-label selection, achieving significant performance gains over single-step fine-tuning.
Authors:Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Abstract:
The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.
中文: 提出的TaDA方法通过逐层自适应量化精度和均值中心化技术压缩Transformer的KV缓存,无需单独处理异常值即可将内存占用降至原有的27%,同时保持准确率。
English: The proposed TaDA method compresses the transformer's KV cache by adapting quantization precision per layer and using mean centering, eliminating the need for separate outlier handling while reducing memory usage to 27% of the original with maintained accuracy.
Authors:Xueqiang Xu, Jinfeng Xiao, James Barry, Mohab Elkaref, Jiaru Zou, Pengcheng Jiang, Yunyi Zhang, Max Giammona, Geeth de Mel, Jiawei Han
Abstract:
Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.
中文: ZOES是一种创新的零样本方法,通过丰富、精炼和统一机制,无需预定义模式或标注即可从文本中提取完整的实体结构,在多个领域展现出卓越的性能和泛化能力。
English: ZOES is a novel zero-shot method that employs enrichment, refinement, and unification to extract complete entity structures from text without predefined schemas or annotations, demonstrating superior performance and generalizability across multiple domains.
Authors:Farzad Farhadzadeh, Debasmit Das, Shubhankar Borse, Fatih Porikli
Abstract:
We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models. ProLoRA transfers pre-trained low-rank adjustments (e.g., LoRA) from a source to a target model without additional training data. This overcomes the limitations of traditional methods that require retraining when switching base models, often challenging due to data constraints. ProLoRA achieves this via projection of source adjustments into the target model's weight space, leveraging subspace and null space similarities and selectively targeting aligned layers. Evaluations on established text-to-image models demonstrate successful knowledge transfer and comparable performance without retraining.
中文: ProLoRA通过将预训练的低秩调整参数投影到目标模型的权重空间中,实现了文本到图像扩散模型间的零样本适应,无需重新训练即可完成知识迁移并保持相当的生成性能。
English: ProLoRA enables zero-shot adaptation of pre-trained low-rank adjustments between text-to-image diffusion models by projecting source adjustments into the target model's weight space, eliminating the need for retraining while maintaining comparable performance.
Authors:Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
Abstract:
Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.
中文: OpenThoughts项目通过构建开源推理数据集,其最新模型OpenThoughts3-7B在AIME等核心基准测试中创下最佳性能记录,所有资源已在官网开源。
English: The OpenThoughts project develops open-source datasets for reasoning models, with its latest OpenThoughts3-7B model achieving state-of-the-art results on key benchmarks like AIME and LiveCodeBench through systematic pipeline improvements.
Authors:Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma
Abstract:
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Chinese: 本研究通过数学推导提出了一种完全可解释的Transformer架构,其目标是将带噪声的令牌表示压缩至低维子空间,仅使用自注意力和跳跃连接进行迭代去噪,在视觉和语言任务上取得了接近GPT-2等标准模型的性能。
English: This study proposes a fully interpretable transformer architecture by deriving it from a mathematical goal of compressing noisy token representations into low-dimensional subspaces through iterative denoising, achieving performance comparable to standard transformers like GPT-2 with only self-attention and skip connections.
Authors:Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
中文: 本研究提出一种结合词错误率预测、命名实体识别和字错误率分析的鲁棒数据筛选方法,能从Whisper和Zipformer模型中选取高质量伪标签,仅需原数据量的1.4%即可实现与原数据集相当的语音识别微调效果。
English: This study introduces a robust data filtering method that combines WER prediction, NER, and CER analysis to select high-quality pseudo-labels from Whisper and Zipformer models, enabling effective ASR fine-tuning with only 1.4% of the original data while maintaining comparable performance.
Authors:Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Abstract:
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
中文: 本研究提出一种结合词错误率预测、命名实体识别和字错误率分析的鲁棒数据筛选方法,能从Whisper和Zipformer模型中选取高质量伪标签,仅需原数据量的1.4%即可实现与原数据集相当的语音识别微调效果。
English: This study introduces a robust data filtering method that combines WER prediction, NER, and CER analysis to select high-quality pseudo-labels from Whisper and Zipformer models, enabling effective ASR fine-tuning with only 1.4% of the original data while maintaining comparable performance.
Authors:Hansen Feng, Lizhi Wang, Yiqi Huang, Tong Li, Lin Zhu, Hua Huang
Abstract:
The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \href{https://fenghansen.github.io/publication/YOND}{project homepage}.
中文: YOND是一种仅基于合成数据训练的新型盲原始图像去噪方法,通过三个关键模块克服了相机特定依赖性,能够稳健地泛化到各种未知相机。
English: YOND is a novel blind raw image denoising method trained solely on synthetic data that overcomes camera-specific dependency through three key modules, enabling robust generalization to diverse unknown cameras.
Authors:Lin Mu, Guowei Chu, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can substantially degrade their performance. Despite advances in prompting techniques, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy specifically designed to enhance the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are then used to construct prompts that automatically correct input errors. In the Guidance stage, RoP generates an optimal guidance prompting based on the corrected input, steering the model toward more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Notably, it maintains model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.
大型语言模型容易受到输入错误的影响,而本研究提出的提示鲁棒性(RoP)策略通过错误纠正和引导机制,显著提升了模型对抗干扰的稳定性。
Large Language Models are vulnerable to input errors, but the proposed Robustness of Prompting (RoP) strategy effectively enhances their resilience through error correction and guidance techniques.
Authors:Yuntian Wang, Zafer Yilmaz, Yuhang Li, Edward Liu, Eric Ahlberg, Farid Ghahari, Ertugrul Taciroglu, Aydogan Ozcan
Abstract:
Structural Health Monitoring (SHM) is vital for maintaining the safety and longevity of civil infrastructure, yet current solutions remain constrained by cost, power consumption, scalability, and the complexity of data processing. Here, we present a diffractive vibration monitoring system, integrating a jointly optimized diffractive layer with a shallow neural network-based backend to remotely extract 3D structural vibration spectra, offering a low-power, cost-effective and scalable solution. This architecture eliminates the need for dense sensor arrays or extensive data acquisition; instead, it uses a spatially-optimized passive diffractive layer that encodes 3D structural displacements into modulated light, captured by a minimal number of detectors and decoded in real-time by shallow and low-power neural networks to reconstruct the 3D displacement spectra of structures. The diffractive system's efficacy was demonstrated both numerically and experimentally using millimeter-wave illumination on a laboratory-scale building model with a programmable shake table. Our system achieves more than an order-of-magnitude improvement in accuracy over conventional optics or separately trained modules, establishing a foundation for high-throughput 3D monitoring of structures. Beyond SHM, the 3D vibration monitoring capabilities of this cost-effective and data-efficient framework establish a new computational sensing modality with potential applications in disaster resilience, aerospace diagnostics, and autonomous navigation, where energy efficiency, low latency, and high-throughput are critical.
中文: 本研究提出了一种衍射振动监测系统,通过空间优化的被动衍射层与浅层神经网络相结合,远程获取三维结构振动谱,为结构健康监测提供了低功耗、成本效益高且可扩展的解决方案,具备卓越精度和广泛的应用前景。
English: This study introduces a diffractive vibration monitoring system that combines a spatially-optimized passive diffractive layer with shallow neural networks to remotely capture 3D structural vibration spectra, offering a low-power, cost-effective, and scalable solution for Structural Health Monitoring with superior accuracy and broad application potential.
Authors:Tyler Chen, Pradeep Niroula, Archan Ray, Pragna Subrahmanya, Marco Pistoia, Niraj Kumar
Abstract:
A litany of theoretical and numerical results have established the sketch-and-precondition paradigm as a powerful approach to solving large linear regression problems in standard computing environments. Perhaps surprisingly, much less work has been done on understanding how sketch-and-precondition performs on graphics processing unit (GPU) systems. We address this gap by benchmarking an implementation of sketch-and-precondition based on sparse sign-sketches on single and multi-GPU systems. In doing so, we describe a novel, easily parallelized, rejection-sampling based method for generating sparse sign sketches. Our approach, which is particularly well-suited for GPUs, is easily adapted to a variety of computing environments. Taken as a whole, our numerical experiments indicate that sketch-and-precondition with sparse sign sketches is particularly well-suited for GPUs, and may be suitable for use in black-box least-squares solvers.
中文: 本研究在GPU系统上对基于稀疏符号草图的素描与预处理方法进行了基准测试,提出了一种易于并行化的生成方法,并证明其特别适用于GPU且有望用于黑盒最小二乘求解器。
English: The study benchmarks sketch-and-precondition with sparse sign sketches on GPU systems, introducing a parallelizable generation method and demonstrating its suitability for GPUs and potential use in black-box solvers.
Authors:Alexandra González, Xavier Franch, David Lo, Silverio MartÃnez-Fernández
Abstract:
Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card descriptions and metadata, and the abstract of the associated arXiv papers. We confirmed SE relevance through multiple filtering steps: detecting outliers, identifying near-identical PTMs, and the use of Gemini 2.0 Flash, which was validated with five pilot studies involving three human annotators. This approach uncovered 2,205 SE PTMs. We find that code generation is the most common SE task among PTMs, primarily focusing on software implementation, while requirements engineering and software design activities receive limited attention. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2. Our classification provides a solid foundation for future automated SE scenarios, such as the sampling and selection of suitable PTMs.
中文摘要:本研究针对软件工程需求开发了涵盖147项任务的分类体系,通过对Hugging Face平台的2205个预训练模型进行分类分析,发现代码生成是最主要的软件工程任务,而需求工程和软件设计领域获得的关注明显不足。
English Summary: The study develops a taxonomy for 147 software engineering tasks and applies it to classify 2,205 pre-trained models from Hugging Face, revealing code generation as the dominant SE task while identifying gaps in requirements engineering and software design coverage.
Authors:Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, Chengzhong Xu
Abstract:
Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications. These devices must balance latency requirements with energy consumption and model accuracy. In this paper, we first quantify the challenges of deploying LLMs on off-the-shelf edge devices and then we present CLONE, an in-depth algorithm-hardware co-design at both the model- and system-level that intelligently integrates real-time, energy optimization while maintaining robust generality. In order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize in a 28nm scalable hardware accelerator system. We implement and extensively evaluate CLONE on two off-the-shelf edge platforms. Experiments show that CLONE effectively accelerates the inference process up to 11.92x, and saves energy up to 7.36x, while maintaining high-generation.
Chinese: 本文提出了CLONE,一种算法与硬件协同设计方案,通过在边缘设备上优化大型语言模型的部署,实现推理速度最高提升11.92倍、能耗最多降低7.36倍,同时保持高性能。
English: This paper introduces CLONE, an algorithm-hardware co-design that optimizes the deployment of large language models on edge devices by accelerating inference up to 11.92 times and reducing energy consumption by up to 7.36 times while maintaining performance.
Authors:Masaki Sakata, Benjamin Heinzerling, Sho Yokoi, Takumi Ito, Kentaro Inui
Abstract:
We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions -- ambiguity and variability -- and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.
中文摘要:本研究通过分析语言模型内部表征的聚类模式,评估其如何表示命名实体,发现模型能有效识别实体(精确率/召回率达0.66-0.9),且实体信息在早期网络层中以低维线性子空间形式紧凑编码。
English Summary: This study evaluates how language models internally represent named entities by analyzing clustering patterns in their representations, revealing effective entity identification with precision/recall metrics between 0.66-0.9 and showing entity information is compactly encoded in early model layers.
Authors:Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, Chenxi Liu
Abstract:
Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everything (V2X) data into a time-indexed and language-based knowledge pool. By leveraging a dual-query Retrieval-Augmented Generation (RAG) mechanism, which enables retrieval of both static and dynamic knowledge, our system enables ADs to perform accurate, temporally consistent reasoning over both static environment and dynamic traffic context. Experiments on a real-world cooperative driving dataset demonstrate that V2X-UniPool significantly enhances motion planning accuracy and reasoning capability. Remarkably, it enables even zero-shot vehicle-side models to achieve state-of-the-art performance by leveraging V2X-UniPool, while simultaneously reducing transmission cost by over 99.9\% compared to prior V2X methods.
中文: V2X-UniPool通过整合多模态车联网数据构建知识池,解决了自动驾驶中的感知局限和幻觉问题,显著提升了推理能力和运动规划精度,同时将传输成本降低了99.9%以上。
English: V2X-UniPool is a unified framework that addresses perception limitations and hallucination in autonomous driving by integrating multimodal V2X data into a knowledge pool, enhancing reasoning accuracy and motion planning while drastically cutting transmission costs.
Authors:Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
Abstract:
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects. While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program's entry point. Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.
中文: CyberGym作为一个包含1,507个真实漏洞的大规模基准平台,有效弥补了现有AI评估的不足,不仅能准确区分网络安全能力,还发现了35个零日漏洞和17个历史补丁缺陷。
English: CyberGym is introduced as a large-scale benchmark with 1,507 real-world vulnerabilities to address limitations in existing AI agent evaluations, effectively differentiating cybersecurity capabilities and leading to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches.
Authors:Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
Abstract:
AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
中文: CyberGym作为一个包含1,507个真实漏洞的大规模基准平台,有效弥补了现有AI评估的不足,不仅能准确区分网络安全能力,还发现了35个零日漏洞和17个历史补丁缺陷。
English: CyberGym is introduced as a large-scale benchmark with 1,507 real-world vulnerabilities to address limitations in existing AI agent evaluations, effectively differentiating cybersecurity capabilities and leading to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches.
Authors:Xiyu Zhao, Qimei Cui, Ziqiang Du, Weicai Li, Xi Yu, Wei Ni, Ji Zhang, Xiaofeng Tao, Ping Zhang
Abstract:
Personalized federated learning (PFL) offers a solution to balancing personalization and generalization by conducting federated learning (FL) to guide personalized learning (PL). Little attention has been given to wireless PFL (WPFL), where privacy concerns arise. Performance fairness of PL models is another challenge resulting from communication bottlenecks in WPFL. This paper exploits quantization errors to enhance the privacy of WPFL and proposes a novel quantization-assisted Gaussian differential privacy (DP) mechanism. We analyze the convergence upper bounds of individual PL models by considering the impact of the mechanism (i.e., quantization errors and Gaussian DP noises) and imperfect communication channels on the FL of WPFL. By minimizing the maximum of the bounds, we design an optimal transmission scheduling strategy that yields min-max fairness for WPFL with OFDMA interfaces. This is achieved by revealing the nested structure of this problem to decouple it into subproblems solved sequentially for the client selection, channel allocation, and power control, and for the learning rates and PL-FL weighting coefficients. Experiments validate our analysis and demonstrate that our approach substantially outperforms alternative scheduling strategies by 87.08%, 16.21%, and 38.37% in accuracy, the maximum test loss of participating clients, and fairness (Jain's index), respectively.
Chinese: 本文针对无线个性化联邦学习提出了一种新颖的量化辅助高斯差分隐私机制,通过设计最优传输调度策略,在准确性、最大测试损失和公平性指标上显著优于其他方法。
English: This paper introduces a novel quantization-assisted Gaussian differential privacy mechanism for wireless personalized federated learning (WPFL) to enhance privacy and fairness, proposing an optimal transmission scheduling strategy that significantly outperforms alternatives in accuracy, maximum test loss, and fairness metrics.
Authors:Kedir Yassin Hussen, Walelign Tewabe Sewunetie, Abinew Ali Ayele, Sukairaj Hafiz Imam, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Abstract:
Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa's 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 42 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98\% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge'ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, computational costs being very high, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages.
中文: 大语言模型对非洲2000种低资源语言支持严重不足,仅覆盖42种语言且98%缺乏支持,面临数据匮乏和文字处理等挑战,亟需社区推动标准化和语料库建设。
English: Large Language Models largely overlook Africa's 2,000 low-resource languages, with only 42 supported and 98% unsupported, facing challenges like data scarcity and script limitations that require community-driven solutions.
Authors:Priyaranjan Pattnayak, Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Srikant Panda
Abstract:
Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95\%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.
中文摘要:本文提出一种融合检索增强生成与基于意图预设回复的混合框架,为企业级对话AI应用实现了高准确率和低延迟的平衡。
English Summary: This paper introduces a hybrid framework combining retrieval-augmented generation with intent-based responses, achieving high accuracy and low latency for enterprise conversational AI systems.
Authors:Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Aixin Sun, Yequan Wang
Abstract:
Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.
中文: RoboEgo是一个统一模型系统,旨在解决多模态人工智能中的挑战,支持全模态处理和全双工响应,在真实世界交互中实现低延迟和卓越性能。
English: RoboEgo is a unified model system designed to overcome challenges in multimodal AI by supporting omnimodal processing and full-duplex responses, achieving low latency and superior performance in real-world interactions.